Data science is the backbone of numerous ad-tech firms competing in an increasingly technology-driven environment. Explaining what data scientists do is difficult, however, given that their work is often described as a mix of art and science and varies with each company.
AdExchanger asked data scientists from Decide.com, a price research firm acquired by eBay, and location analytics company Placed to describe a problem they’ve solved in a snapshot look at the roles data scientists play. Read on for their (lightly edited) responses.
Who: David Hsu, principal engineer at eBay
The challenge: Identifying variants in similar products
“At Decide.com, I led a team of data scientists that worked on a broad range of projects including a prediction algorithm for how product prices will change in the future, a smart product-rating system, a system for finding and associating product news from the Web with relevant products and internal systems (such as product categorization) for organizing our product catalog.
One of the more subtle problems that the data-science team resolved [was] determining which products are variants of each other. In the product catalog of most ecommerce sites, many items can be considered variants of each other such as different-colored versions of baby seats or laptop products that vary in the amount of memory installed or size of the hard drive.
Detecting these variants can improve the shopping experience in multiple ways. For example, within the search results page, variants of the same product can be collapsed together to reduce clutter and increase the diversity of products shown. In addition, product reviews for different variants may be pooled together so that variants receive the same overall rating. Two refrigerators that only differ in color, for example, should not have different review ratings.
Thinking through the product aspects of this problem, two things became apparent that influenced how we tackled the problem:
a. There are no hard and fast rules as to what constitutes a variant. This differs by category and can change over time.
b. The cost of false positives (saying two products are variants of each other when they aren’t) is much worse than the cost of false negatives (missing two potential variants).
Consequently, we decided to pursue an interactive machine-learning approach [along with manually verifying the algorithms] to solve the problem. This approach had the advantage of scalability across categories and accuracy. Developing this system required both developing an automated algorithm for solving the first step, and an interactive dashboard that data specialists can use to effectively perform the second step.
I won’t go into the details of the learning algorithm that we used for the automated clustering, but we ended up using a clustering approach known as hierarchical agglomerative clustering that can find clusters of similar products as long as there is a way to generate a numeric score for how similar two products are. Our similarity score was generated via a machine-learned classifier that was trained on pairs of known variants and known dissimilar products.
The variant relationships our system produced ended up improving the experience of shopping for products on the Decide.com website in a few ways. First, we used these generated variant relationships to ensure that product ratings that Decide.com generated would be common across all variants of the same product. Second, we altered the product search results page to collapse variants of the same products into a single search result entry.
Before if someone searched our site for the ‘Canon DSLR camera,’ we might return hundreds of nearly identical cameras that only differed via what additional accessories were included. Afterwards, we would return one entry for each major product line offered by Canon, along with the option of drilling down on the different variants of each product line.
Finally, the variant relationships that the system discovered powered the ‘other variants’ section of the product detail pages for the decide website. For example, if a customer was looking at our product page for a Canon EOS Rebel T3i SLR camera, they could navigate to additional camera bundles based on the Rebel T3i from the variant pop-up on the product page.”
Who: Weilie Yi, principal scientist at Placed, Inc.
The challenge: Bringing Web-like consumer measurements to the offline world
“With today’s advancements in mobile location technology, understanding the places people visit in the physical world should seem like an easy feat. But that’s far from the case. In fact, Placed found that 90% of the time, assigning the closest place to a latitude and longitude point measured through cellular GPS assigned a visit to the wrong business. Compared to website analytics, location analytics had a long way to go, and arguably a much more complex measurement problem to solve.
At Placed, the foundation of our business is the ability to accurately measure the places people go in the offline world by leveraging the billions of location data points we measure from the world’s largest opt-in location smartphone panel. In order to accurately identify the places people visit offline, our data science team developed an algorithm known as the Placed Inference Engine, which measures and interprets location data.
Before we go deeper into the science, it’s important to understand that people leave a trail of location data throughout their day as they move from point A to point B and so on. In fact, Placed measures on average 1,000 location data points from each panelist per day. Obviously, it’s nearly impossible to visit 1,000 businesses in one day, but these location points are akin to data crumbs that a consumer leaves behind as they move about in the physical world. We built the Placed Inference Engine to stitch together these data crumbs to determine a visit to a business versus a nonvisit (e.g., they walked by, were at a stop light or were at the business next door).
For example, between 9:15 am and 9:28 am on a Monday morning, a location panelist was in the vicinity of a coffee shop, a fast-food restaurant and a clothing store. One would think picking the closest business to the latitude/longitude point would indicate the place the person was actually visiting, but this approach has a number of pitfalls. First of all, we don’t necessarily have the user’s precise location as when the phone’s GPS is able to get optimal reception, which is often unlikely when somebody is buying a coffee. Secondly, the user could be moving around. Imagine a user walks 200 feet from his or her parked car to the coffee shop, passing by a fast-food restaurant. Such a sequence of location data points creates ambiguity, however precise the location data may be. Last, but not least, even if we know exactly the location coordinates of where a user was, we have to know that there is a coffee shop at that precise location. Unfortunately, there isn’t a single-source point-of-interest (POI) database that has complete, accurate and up-to-date location information.
To solve for this inherent complexity in accurately measuring location, Placed decided to use machine learning by leveraging an enormous volume of location data. Machine learning is a group of techniques that automatically extract the relationship between data, in our case, between user location and places. The algorithm, which we call the Placed Inference Engine, examines multiple signals associated with each place nearby, with the vicinity and with the user. For example, at 9:15 am you are more likely to be at a coffee shop rather than a clothing store. But if that particular coffee shop is closed on Mondays, the clothing store becomes a stronger candidate.
Leveraging our scientists’ background in Web search, we took a page from modern search engine history. When you are looking for certain information on the Web, you type in a query text such as “Samsung market share” in Google and receive relevant links, with (hopefully) the most relevant ones rising to the top. Similarly, with our Inference Engine, the query is the geographic coordinate of a location instead of a text string, and the results are a list of businesses the user is mostly likely to be visiting. Leveraging all of the location information Placed has measured and analyzed, we assign probabilities to each of those business results to determine the place the user was most likely visiting. For example, a user could be visiting the coffee shop with 81% probability, as well as the fast-food restaurant with 19% probability. We send those results to users in the form of survey questions that have provided more than 5 million validation proof points, and then feed these verification points to our machine-learning algorithm to continually improve the accuracy of the probabilities.
This probabilistic modeling is powerful because it allows Placed to apply a statistical probability to each and every visit that a person makes throughout their day. The Inference Engine, a machine-learned algorithm, is the foundation of every product at Placed and has created the ability to … bring offline measurement closer to that of the online world.”