Spark Project: Exploring and Forecasting Urban Pollution
Briefly

Spark Project: Exploring and Forecasting Urban Pollution
"First, I normalized all text columns (City, Location, Pollutant) to remove inconsistencies such as uppercase/lowercase differences, unwanted characters, and extra spaces. Then I cleaned the data by converting blank strings into nulls, removing missing rows, and dropping duplicates to ensure reliable input. After that, I generated time-based features like Hour, Day, Month, and Year from the timestamp, which help capture temporal patterns in pollution levels. I also standardized column types and kept only cities with enough data points to improve model stability."
"Next, I handled cases where pollutants appear in comma-separated lists by splitting them using flatMap. I added several engineered features such as average, max, min values per pollutant-station, global index normalization, weekend indicator, season category, timestamps, rush-hour flag, rolling statistics (3-hour window), z-score normalization per pollutant, and pollutant count per station. By doing all these steps, the dataset becomes clean, consistent, enriched, and statistically meaningful - allowing my predictive models to learn accurate patterns for urban pollution forecasting."
A complete data-cleaning and feature-engineering pipeline prepares raw pollution data for machine learning. Text columns such as City, Location, and Pollutant are normalized to remove case differences, unwanted characters, and extra spaces. Blank strings are converted to nulls, missing rows are removed, and duplicates are dropped to ensure reliable input. Time-based features including Hour, Day, Month, and Year are generated from timestamps to capture temporal patterns. Column types are standardized and cities with insufficient data are excluded to improve model stability. Comma-separated pollutant lists are split with flatMap. Engineered features include per-station average, max, min, reading counts, global index normalization, weekend and season indicators, rush-hour flags, 3-hour rolling statistics, z-score normalization per pollutant, and pollutant counts per station, producing a statistically meaningful dataset for forecasting.
Read at Medium
Unable to calculate read time
[
|
]