Privacy-Label Web Crawler at Scale
A Python/GCP crawling pipeline that extracted and processed metadata from 3M+ Google Play apps for privacy research.
Research Assistant
- Apps processed
- 3M+
In CMU's Mobility Privacy & Security Lab, I built the data infrastructure behind a large-scale study of app privacy practices.
Problem
App privacy labels are meant to make data collection practices visible to users: what an app collects, whether data is encrypted, whether deletion is supported, and how that varies across categories. For researchers, the challenge is that those disclosures live inside millions of app-store pages, not in one clean table.
Studying privacy trends across the app ecosystem meant collecting structured metadata from millions of Google Play listings — reliably, at scale, and despite inconsistent page structures, missing fields, and static/dynamic HTML containers.
Crawling pipeline
Turning app-store pages into research-ready privacy data
Click through the stages to see how raw Google Play listings become structured privacy-label metadata at cloud scale.
From page text to schema
The useful output was not a page scrape. It was normalized metadata.
The crawler had to preserve the semantics researchers cared about: what data an app collects, whether it is encrypted, whether deletion is supported, and how those disclosures vary by app category.
App identity
name, package, category
Collected data
location, activity, personal info
Security claims
encrypted, deletion, policy flags
Research features
clean rows for trend analysis
Approach
I built a Python / GCP web-crawling pipeline to extract and process metadata from 3M+ Google Play apps. The crawler started from app IDs and listing URLs, fetched the corresponding pages, isolated the privacy-label section, and converted page text into structured fields suitable for research.
Some fields were straightforward to parse with libraries like Beautiful Soup. Others were buried in more complex containers, so the crawler had to search HTML strings for stable patterns and normalize the extracted values into a consistent schema.
The pipeline was designed to scale gradually: validate the parser on a small batch, expand to larger crawls on Google Cloud, store outputs safely, and produce a clean dataset researchers could query for trend analysis.
Impact
The pipeline turned a sprawling, messy source into research-ready data, enabling analysis of privacy practices across the Play Store at a scale that manual collection could never reach.
The work gave me first-hand experience with cloud experiments, web backends, dynamic page structure, and the practical problem of turning messy internet-scale data into something researchers can trust. The code remains proprietary to Carnegie Mellon.