All projects
CMU Mobility Privacy & Security LabSpring 2023

Privacy-Label Web Crawler at Scale

A Python/GCP crawling pipeline that extracted and processed metadata from 3M+ Google Play apps for privacy research.

Research Assistant

Apps processed
3M+
PythonGCPWeb crawlingData pipelines

In CMU's Mobility Privacy & Security Lab, I built the data infrastructure behind a large-scale study of app privacy practices.

Problem

App privacy labels are meant to make data collection practices visible to users: what an app collects, whether data is encrypted, whether deletion is supported, and how that varies across categories. For researchers, the challenge is that those disclosures live inside millions of app-store pages, not in one clean table.

Studying privacy trends across the app ecosystem meant collecting structured metadata from millions of Google Play listings — reliably, at scale, and despite inconsistent page structures, missing fields, and static/dynamic HTML containers.

Crawling pipeline

Turning app-store pages into research-ready privacy data

Click through the stages to see how raw Google Play listings become structured privacy-label metadata at cloud scale.

queueapp pageparserdatasetcom.health.fitcom.game.runcom.wallet.paycom.chat.appData privacyLocationApp activityPersonal infoFinancial infoextract fieldsLocationApp activityPersonal infoFinancial infoData encryptedData can be deletedschema rowcom.fit.appcom.game.runcom.wallet.payWWWWW3M+ appsresearch-ready metadata
Python · GCP · privacy labels
3M+ appsHTML parsingCloud workers

From page text to schema

The useful output was not a page scrape. It was normalized metadata.

The crawler had to preserve the semantics researchers cared about: what data an app collects, whether it is encrypted, whether deletion is supported, and how those disclosures vary by app category.

App identity

name, package, category

Collected data

location, activity, personal info

Security claims

encrypted, deletion, policy flags

Research features

clean rows for trend analysis

Approach

I built a Python / GCP web-crawling pipeline to extract and process metadata from 3M+ Google Play apps. The crawler started from app IDs and listing URLs, fetched the corresponding pages, isolated the privacy-label section, and converted page text into structured fields suitable for research.

Some fields were straightforward to parse with libraries like Beautiful Soup. Others were buried in more complex containers, so the crawler had to search HTML strings for stable patterns and normalize the extracted values into a consistent schema.

The pipeline was designed to scale gradually: validate the parser on a small batch, expand to larger crawls on Google Cloud, store outputs safely, and produce a clean dataset researchers could query for trend analysis.

Impact

The pipeline turned a sprawling, messy source into research-ready data, enabling analysis of privacy practices across the Play Store at a scale that manual collection could never reach.

The work gave me first-hand experience with cloud experiments, web backends, dynamic page structure, and the practical problem of turning messy internet-scale data into something researchers can trust. The code remains proprietary to Carnegie Mellon.