A significant percentage of location data in the mobile ad ecosystem – anywhere from 30% – 70% – is of insufficient quality for appropriate use in location based mobile ad targeting, measurement, or analytics. In a previous post, Validating Mobile Ad Location Data at Factual, we describe the different reasons for this and the variety of methods we employ to pre-process location data. In this post, the first of many that explore specific pathologies of low quality location data, we’ll do a deep dive into one specific driver of low quality location data – app permissions.
The Android mobile operating system offers two permissions for location tracking – “coarse” (ACCESS_COARSE_LOCATION) and “fine” (ACCESS_FINE_LOCATION). With the coarse permission, according to Google, the OS will return a location with the accuracy approximately equivalent to a city block1(although we found it to be 2,000 meters of error in most cases).
The major problem is that this nuance gets lost in the mobile ad ecosystem. The data we see in mobile exchanges from apps with coarse location permissions, on its face, looks like the data from apps with fine location permissions. The data comes through with up to 6 decimal places of precision, so it looks highly precise2, and the type flag in the geo object is set to 1 – “GPS/Location Services3,” which is true, but obscures the fact that Android has decreased the accuracy of these points.
Factual employs automated methods to detect and filter out this kind of activity. The system is built on a statistical model that learns which points are over-represented based on all points in the system. We also maintain a blacklist of apps that have significant percentages of blacklisted traffic.
Dive into the Data
Note: For an in-depth, technical look into the research process, please refer to the lab notes.
A specific app publisher reported to us that our Location Validation Stack was blacklisting a significant portion of their data, so we decided to investigate. We noticed that a lot of the latitudes being passed had repeated groups of 3 digits.
Purely by chance one would expect to see this for one out of every thousand data points (0.1%), but we were seeing it in more than 5% of the unvalidated inputs. Although we were already detecting these as invalid, we wanted to investigate more carefully to understand the underlying cause.
Step 1: Characterizing the pathology
Since we were seeing repeated groups of 3 digits, I represented each latitude in the form X.YYYZZZ, where YYY are the first three decimal digits and ZZZ are the next three. For example, the latitude 34.194783 has X = 34, YYY = 194, and ZZZ = 783. I then created a histogram of the difference between YYY and ZZZ. For comparison, I also created a histogram of a difference distribution of uniformly random values (which is more or less what we’d expect if the latitudes were unmodified).
Clearly the observed latitude data is not behaving as one would expect, and we’re seeing an abnormal concentration of values with deltas of 0, 1, and 2.
One obvious question is whether the longitude values exhibit a similar behavior. Interestingly, they don’t; as you can also see on the above chart.
Step 2: Measuring the scope of the problem
While the investigation started because one publisher inquired as to why their data was being blacklisted, we quickly discovered that the problem spanned many apps, Android devices, and geographical regions.
We found that about one in three apps had a statistically significant error rate. When we looked at device types, we found that Apple doesn’t seem to have a problem, and some Android devices were OK as well.
Number of Data Points
Fraction of Bogus Points
iPhone 5S (GSM)
iPhone 5 (GSM+CDMA)
iPhone 6 Plus
iPhone 5C (GSM)
iPhone 5S (GSM+CDMA)
Lastly, when plotting the coordinates on a map, they look reasonable and are from all around the world.
Step 3: Identifying Possible Root Causes
The following is a latitude correlation graph, where the X axis is the first 3 digits and the Y axis is the second 3 digits, compared to a covariance of uniformly random values:
There is a clear diagonal line through the center of the latitude correlation graph, illustrating the high frequency of similar 3 digit pairs. There’s an interesting pattern going on though, which becomes more apparent if you zoom in. It looks like the covariant digit groups have even spacing.
Zooming in some more…
The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization, since no real-world geographic feature has this much regularity). This suggests that the latitudes are being rounded to the nearest 0.009009 (Z.306306 – Z.297297 = 0.009009).
Before we conclude that the latitudes are being rounded to the nearest 0.009009 we need to make sure that what we’re seeing isn’t the overlap of multiple signals. (To picture this, suppose you rounded a bunch of numbers to the nearest ⅜; looking just at the decimals you’d think it was ⅛) We can do this with a Fourier transform to recover base frequencies, which in this case tells us that the actual quantization is happening at intervals of 0.018018 instead of 0.009009.
We see 0.009009 because the pattern is doubled on an even/odd basis:
The Underlying Problem
All of the apps with a high error rate requested coarse network-based location permissions. The Android source code includes a class called “LocationFudger”, specifically designed to quantize locations for user privacy:
* Contains the logic to obfuscate (fudge) locations for coarse applications.
* <p>The goal is just to prevent applications with only
* the coarse location permission from receiving a fine location.
The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g., GPS) are deliberately quantized to prevent data leakage. The reported locations aren’t completely wrong, just up to 2,000 meters away from the actual location of the device.
Our standard model was already identifying the bad data, but it’s good to have classifiers specifically designed to detect known errors. While one can use other techniques to flag apps with insufficient permissions (e.g., crawling the app store), we’ve found that these data-driven models are lower-latency and more reliable, as they have fewer external dependencies. Another advantage of a purely data-driven approach is that it allows us to safely process data whose source is unknown or incorrectly indicated.
– Spencer Tipping, Software Engineer and Vikas Gupta, Director of Marketing
Spencer Tipping is a software engineer at Factual. Spencer joined Factual in 2012 and is responsible for quantitative research, fraud detection, and last-resort problem solving. His favorite word is “arbitrage”, and he likes to think he makes Factual run more efficiently. Prior to joining Factual, Spencer worked for a number of startups, designed programming languages, and wrote a lot of self-modifying code.
Vikas Gupta is Director of Marketing for Factual. Vikas joined Factual in 2011 and runs Factual’s marketing department. He is an active participant in trade organizations, including co-chairing the MMA’s Location Committee and co-chairing the IAB Mobile Center’s Location Data Working Group. Prior to Factual, Vikas was at LoopNet Inc. for 3 years where he did marketing strategy and analytics for both LoopNet’s core commercial real estate business and their business for sale subsidiary BizBuySell. Prior to LoopNet he was a business analyst at McKinsey & Company where he served clients in financial services, airlines, and chemicals in a variety of capacities including strategy, operations, and post-merger management.
We at Factual believe that data should be accessible to every developer, entrepreneur, business, or organization – anyone who needs it to build a better app, provide a better search result, make smarter software – anyone who needs data to make a better decision or help others make better decisions. Today, Factual focuses on making location data accessible to the mobile world – data about places across the globe, and data which reveals a deeper understanding of people based on their geographic behavior.