Oak Park Crime Reporting

Data Science
Data Analytics
Data Engineering
Author

Jesse Anderson

Published

January 9, 2025

Oak Park Crime Reporting Documentation

Table of Contents

  1. Motivation

  2. How to Use the Tool

  3. Documentation Introduction

  4. Data Parsing

  5. Live Streamlit Dashboard

  6. Static HTML Generation

  7. Conclusion

  8. Future Enhancements

Motivation

Tracking and analyzing crime data is crucial for fostering safer communities and enabling informed decision-making. While this tool is not designed nor should not be used in any informal or informal decision making per the disclaimers presented within the application it is my hope that it triggers discussion regarding using analytics to enable data driven decision making per my other projects both personal and professional. My journey to develop a comprehensive crime tracking tool for Oak Park was driven by the challenges and inefficiencies I encountered while navigating the Oak Park Police Department’s (OPPD) publicly available resources.

Initial Challenges

Navigating the OPPD website presented a significant hurdle. The process involved manually accessing PDF files for specific dates, meticulously copying and pasting individual crime reports, and then attempting to locate each incident on a map. This method was not only time-consuming but also prone to errors, making it difficult to:

  • Quickly Assess Crime Trends: Without an aggregated view, understanding the frequency and distribution of crimes over a given period was arduous.

  • Identify Crime Hotspots: Pinpointing areas with high concentrations of criminal activity lacked precision and was labor-intensive.

  • Monitor Crime Progression: Determining whether crime rates were escalating or declining within Oak Park required extensive manual effort.

Embarking on a Solution

Determined to streamline this process, I embarked on automating data extraction and visualization. The primary objective was to transform unstructured PDF data into a structured format that could be easily analyzed and visualized.

Overcoming Technical Hurdles

The transition from PDFs to actionable data was fraught with challenges:

  1. Data Extraction Complexity:

    • Initial Approach: I leveraged Regular Expressions (Regex) to parse and extract relevant crime data from the PDFs.

    • Error Rates: The initial extraction efforts yielded error rates between 20-30%, compromising data reliability.

  2. Optimizing Data Accuracy:

    • Refining Regex Patterns: Through iterative testing and refinement of Regex patterns, I significantly reduced error rates to 10% and eventually to 5%.

    • Additional Optimizations: Implementing data validation checks and leveraging Python’s robust data processing libraries further enhanced extraction accuracy.

Creating Insightful Visualizations

With accurate data in hand, the next step was to visualize it in a meaningful way:

  • Interactive Maps: Utilizing Folium, I created dynamic maps that plot each crime incident, providing a spatial understanding of criminal activity.

  • Identifying Hotspots: By incorporating Marker Clustering, the maps highlight areas with high crime densities, enabling quick identification of hotspots.

  • Filtering Capabilities: Implementing filtering options allows users to view specific types of crimes or incidents within selected time frames, enhancing the map’s utility.

Expanding Analytical Capabilities

The success of the initial visualization opened avenues for further enhancements:

  • Natural Language Processing (NLP): I recognized the potential to analyze crime narratives using NLP techniques to uncover commonalities and emerging patterns across different incidents.

  • Trend Analysis: Incorporating statistical analyses and trend graphs could provide deeper insights into the progression of crime rates over time.

Note: While NLP and advanced trend analyses offer substantial benefits, they are beyond the scope of this documentation and are earmarked for future development.

Personal Satisfaction and Practical Usage

The culmination of these efforts resulted in a robust, user-friendly tool that not only alleviates the tedious manual processes but also empowers users with actionable insights into Oak Park’s crime landscape. The tool’s ability to:

  • Streamline Data Processing: Automates the extraction and visualization of crime data, saving valuable time.

  • Enhance Decision-Making: Provides clear visual representations of crime trends and hotspots, facilitating informed community and law enforcement decisions.

  • Promote Community Safety: By making crime data more accessible and understandable, it fosters a proactive approach to community safety initiatives.

I am immensely satisfied with the outcome of this project. The tool has become an integral part of my routine, allowing me to stay informed about the evolving crime dynamics in Oak Park efficiently and effectively.


How to Use the Tool

The Oak Park Crime Reporting tool is designed to provide comprehensive insights into crime data within Oak Park through interactive dashboards, static maps, and automated email updates. This section guides you through the various functionalities and how to effectively utilize them.

Using the Live Streamlit Dashboard

The Live Streamlit Dashboard offers an interactive platform to explore and visualize crime data in real-time. Here’s how to make the most of it:

  1. Accessing the Dashboard:

    • Local Deployment: If you’re running the dashboard locally, navigate to the project directory in your terminal and execute:

      streamlit run streamlit_app.py

      This command will launch the dashboard in your default web browser.

    • Hosted Deployment: To access the current free deployment I am using as I don’t particularly want to pay for an AWS instance without funds coming in… access it via the provided URL (e.g., https://op-crime.streamlit.app/).

  2. Navigating the Dashboard:

    • Disclaimer Agreement: Upon first access, you’ll be presented with a legal disclaimer. Read through the terms and check the agreement box to proceed.

    • Interactive Filters:

      • Date Range: Select the desired date range using the date pickers to filter crime data accordingly.

      • Offense Type: Use the multiselect dropdown to filter crimes based on specific offense categories.

    • Visualizing Data:

      • Map Display: The right panel displays a dynamic map highlighting crime incidents based on your filters. Click on markers to view detailed information about each incident.

      • Data Table: Optionally, a table listing all filtered crime records may be available for reference.

  3. Additional Features:

    • Navigation Links: Access related resources such as the author’s portfolio, blog, and documentation through the top navigation links.

    • Email Subscription: There’s an option to subscribe or unsubscribe from weekly email updates for the latest crime reports. Currently the script is run automatically daily to capture new incidents and a report is emailed to myself and a few other people once weekly.

Accessing the Static HTML Maps

For users who prefer or require static reports, the tool generates HTML maps that can be accessed independently of the Streamlit dashboard. Here’s how to access and utilize them:

  1. Generated Maps:

    • Weekly Crime Map: Provides a snapshot of crime incidents from the past week.

    • Cumulative Crime Map: Displays all recorded crime incidents to date.

  2. Access Methods:

    • GitHub Pages: The static maps are hosted on GitHub Pages for easy access. Navigate to the respective URLs:

      • Weekly Map: https://jesse-anderson.github.io/OP-Crime-Maps/crime_map_weekly_YYYY-MM-DD.html

      • Cumulative Map: https://jesse-anderson.github.io/OP-Crime-Maps/crime_map_cumulative.html

    • Direct Access: If hosting elsewhere, ensure the HTML files are uploaded to a web-accessible directory and navigate to their URLs.

  3. Interacting with the Maps:

    • Zoom and Pan: Use your browser’s native controls to zoom in/out and pan across the map.

    • Marker Details: Click on individual markers to view detailed information about each crime incident, including links to original PDF reports.

  4. Embedding Maps:

    • Web Integration: Embed the HTML maps into other websites or internal dashboards using <iframe> tags or direct links for seamless integration.

Signing Up for Weekly Email Updates

Stay informed with the latest crime reports by subscribing to our weekly email updates. Follow these steps to sign up:

  1. Access the Subscription Form:

    • Streamlit Dashboard: Navigate to the “📧 Email Updates” section within the dashboard.

    • Direct Link: Alternatively, use the provided Google Forms link: Add me to Weekly Updates

  2. Submitting Your Email:

    • Subscription:

      • Form Fill: Enter your valid email address in the subscription form.

      • Confirmation: Upon successful submission, you’ll receive a confirmation email verifying your subscription.

    • Unsubscription:

      • Form Fill: To unsubscribe, provide your email address in the unsubscription section of the form.

      • Confirmation: A confirmation email will notify you of the successful removal from the mailing list.

  3. Managing Preferences:

    • Frequency: Currently, the tool sends out weekly updates. Future enhancements may include customizable frequencies such as daily or weekly.

    • Content: Receive links to the latest weekly and cumulative maps, along with attached CSV reports detailing the most recent crime data.

  4. Privacy Assurance:

    • Data Handling: Your email address is securely handled and only used for sending crime report updates.

    • Opt-Out Anytime: You can unsubscribe at any time without any hassle through the provided unsubscription process.


Documentation Introduction

The Oak Park Crime Reporting project is designed to streamline the process of tracking, analyzing, and visualizing crime data within Oak Park. The project automates data extraction from the Oak Park Police Department’s (OPPD) reports, processes and cleans the data, visualizes it on interactive maps, and disseminates the information through various channels such as a Streamlit dashboard and static HTML pages. This documentation provides an in-depth look into the project’s components, focusing initially on data parsing.

Data Parsing

Overview

Data parsing is the foundational step in the Oak Park Crime Reporting pipeline. It involves extracting relevant information from the OPPD’s PDF reports, cleaning and structuring the data, and preparing it for visualization and analysis. The primary scripts responsible for this phase are:

  1. OakPark_Crime_Reporting_Web.py

  2. utils.py

These scripts work in tandem to automate the cumbersome manual process of navigating through PDF files, extracting crime details, and geocoding locations for mapping.

Key Components

1. OakPark_Crime_Reporting_Web.py

Purpose: This script orchestrates the data parsing workflow. It handles downloading PDF reports from the OPPD website, extracting crime data, geocoding locations, managing caches to optimize performance, and committing the processed data to a GitHub repository.

Imports and Dependencies:
Code
import os
import re
import json
import pandas as pd
from pathlib import Path
import logging
from collections import defaultdict
import time
from datetime import datetime
import string
import googlemaps
import zipfile

from utils import (
    load_env_vars,
    normalize_location,
    load_json_cache,
    save_json_cache,
    fetch_pdf_links,
    download_pdf,
    extract_data_from_pdf,
    clean_narrative_basic,
    process_narrative_nlp,
    parse_date,
    clean_text,
    get_lat_long,
    get_api_call_count,
    extract_year,
    git_commit_and_push
)
  • Standard Libraries: os, re, json, pandas, pathlib, logging, collections, time, datetime, string, zipfile

  • Third-Party Libraries: googlemaps

  • Local Utilities: Functions imported from utils.py

Main Functionality: The main() function serves as the entry point, executing the following steps:

  1. Environment Setup:

    • Determines the script’s directory.

    • Loads environment variables from env_vars.txt.

    • Initializes the Google Maps client using the API key.

  2. Directory and Path Configuration:

    • Sets up directories for downloading PDFs, storing data, and caching.

    • Defines paths for output CSV and ZIP files.

  3. Cache Management:

    • Loads existing caches to avoid redundant processing.

    • Tracks already processed complaint numbers to prevent duplicates.

  4. PDF Processing Loop:

    • Fetches PDF links from the OPPD website.

    • Downloads each PDF unless it’s already processed and unchanged.

    • Extracts crime data from PDFs, handling errors and logging.

  5. Data Aggregation and Storage:

    • Combines new data with existing data, removes duplicates, and sorts by date.

    • Compresses the aggregated data into a ZIP file.

  6. Logging and Reporting:

    • Records processing statistics and errors.

    • Saves logs to a dated log file.

    • Calculates and records the error rate in error_rate.txt for the current run.

  7. GitHub Integration:

    • Commits and pushes the updated data to a GitHub repository.

Sample Code Snippet:

Code
def main():
    start_time = time.time()
    start_time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Determine the directory where this script resides
    script_dir = Path(__file__).parent.resolve()

    # Path to the environment variables file relative to script directory
    env_file_path = script_dir / "env_vars.txt"

    # Load environment variables
    try:
        load_env_vars(env_file_path)
    except FileNotFoundError as e:
        print(e)
        return
    
    googlemaps_api_key = os.getenv("GOOGLEMAPS_API_KEY")
    if not googlemaps_api_key:
        raise ValueError("Google Maps API key not found in environment variables.")

    # Initialize Google Maps client
    try:
        gmaps_client = googlemaps.Client(key=googlemaps_api_key)
    except Exception as e:
        logging.critical(f"Failed to initialize Google Maps client: {e}")
        print(f"Failed to initialize Google Maps client: {e}")
        return

    # Define paths relative to script directory
    base_url = 'https://www.oak-park.us/village-services/police-department/police-activity-summary-reports'
    download_dir = script_dir / 'downloaded_pdfs'
    data_dir = script_dir / 'data'
    cache_dir = script_dir / 'cache'
    data_dir.mkdir(parents=True, exist_ok=True)
    cache_dir.mkdir(parents=True, exist_ok=True)

    # ...

Explanation:

  • Environment Variables: Critical for configuring API keys and repository paths without hardcoding sensitive information.

  • Google Maps Client: Essential for geocoding addresses to latitude and longitude coordinates.

  • Directory Setup: Organizes downloaded PDFs, processed data, and caches systematically.

2. utils.py

Purpose:
utils.py contains a collection of helper functions that support the main data parsing operations. These functions handle tasks such as loading environment variables, normalizing location strings, caching mechanisms, PDF link fetching, PDF downloading, text cleaning, date parsing, geocoding, and GitHub file uploads.

Key Functions:

  1. Environment Management:

    • load_env_vars(file_path): Loads environment variables from a specified file.
  2. Data Normalization and Cleaning:

    • normalize_location(loc_str): Standardizes location strings for consistency.

    • clean_text(text): Cleans raw text extracted from PDFs.

    • parse_date(date_str): Parses various date formats into a standardized YYYY-MM-DD format.

  3. Caching Mechanisms:

    • load_json_cache(cache_path): Loads cache data from a JSON file.

    • save_json_cache(cache_path, data): Saves cache data to a JSON file.

  4. PDF Handling:

    • fetch_pdf_links(base_url): Scrapes the OPPD website to retrieve PDF report links.

    • download_pdf(url, download_dir, redownload=False): Downloads PDFs from given URLs.

  5. Data Extraction and Processing:

    • extract_data_from_pdf(file_path, gmaps_client, location_cache, reprocess_locs, existing_complaint_numbers): Extracts structured crime data from PDFs.

    • clean_narrative_basic(narrative): Performs basic cleaning on narrative text.

    • process_narrative_nlp(narrative): Applies NLP techniques to process narrative text.

  6. Geocoding:

    • get_lat_long(location_string, gmaps_client): Converts location strings to geographical coordinates using the Google Maps API.

    • get_api_call_count(): Retrieves the count of API calls made (useful for monitoring rate limits).

  7. GitHub Integration:

    • upload_file_to_github(file_path, github_repo_path, target_subfolder): Uploads a single file to a specified GitHub repository subfolder.

    • upload_files_to_github_batch(file_paths, github_repo_path, target_subfolder): Facilitates batch uploading of multiple files.

    • git_commit_and_push(repo_path, commit_message): Commits and pushes changes to the GitHub repository, handling authentication and potential conflicts.

Sample Code Snippet:

Code
def extract_data_from_pdf(file_path, gmaps_client, location_cache, reprocess_locs, existing_complaint_numbers):
    """
    Extract data from PDF, returning (report, log_entries).
    Only processes complaints not already in existing_complaint_numbers.

    Args:
        file_path (str or Path): Path to the PDF file.
        gmaps_client (googlemaps.Client): Initialized Google Maps client.
        location_cache (dict): Cache of normalized locations to (lat, lng).
        reprocess_locs (bool): Flag to force reprocessing of locations.
        existing_complaint_numbers (set): Set of complaint numbers already processed.

    Returns:
        tuple: (list of report entries, list of log entries)
    """
    try:
        reader = PdfReader(file_path)
        raw_text = " ".join([page.extract_text() or "" for page in reader.pages])
    except Exception as e:
        logging.error(f"Failed to read PDF '{file_path}': {e}")
        return [], [f"Failed to read PDF '{file_path}': {e}"]
    text = clean_text(raw_text)
    base_url_static = 'https://www.oak-park.us/sites/default/files/police/summaries/'
    # Log a preview of the cleaned text
    logging.debug(f"Cleaned Text Preview (first 500 chars): {text[:500]}...")
    
    complaint_pattern = r"COMPLAINT NUMBER:\s*(\d{2}-\d{5})"
    offense_pattern   = r"OFFENSE:\s+([A-Z\s]+)"
    date_pattern      = r"DATE\(S\)\s*:?\s+([A-Za-z0-9\s&\-–—/]+?)(?=\s+TIME\(S\)|\s+$)"
    time_pattern      = r"TIME\(S\):\s+([\d:HRS\s\-–—]+)"
    location_pattern  = r"LOCATION:\s+(.+?)(?=\s+(?:VICTIM/ADDRESS|NARRATIVE|NARRITIVE|NARRTIVE))"
    victim_pattern    = r"VICTIM/ADDRESS:\s+(.+?)(?=\s+NARRATIVE|NARRITIVE|NARRTIVE)"
    narrative_pattern = r"NARR(?:ATIVE|ITIVE|TIVE)\s*:\s+(.+?)(?=COMPLAINT NUMBER|$)"

    complaints = re.findall(complaint_pattern, text)
    offenses   = re.findall(offense_pattern, text)
    dates      = re.findall(date_pattern, text)
    times      = re.findall(time_pattern, text)
    locations  = re.findall(location_pattern, text)
    victims    = re.findall(victim_pattern, text)
    narratives = re.findall(narrative_pattern, text)
    
    # Clean offenses
    offenses = [o.replace("DATE", "").strip() for o in offenses]
    
    report = []
    log_entries = []
    time.sleep(0.2)  # Respectful pause for API calls
    
    num_entries = len(complaints)
    logging.debug(f"Number of complaints found: {num_entries}")
    
    for i in range(num_entries):
        try:
            # Safe indexing
            comp_num = complaints[i] if i < len(complaints) else "N/A"
            
            # Skip already processed complaints
            if comp_num in existing_complaint_numbers:
                logging.info(f"Skipping already processed Complaint # {comp_num}")
                continue

            # Extract other fields
            offense  = offenses[i].strip() if i < len(offenses) else "N/A"
            time_str = times[i].strip() if i < len(times) else "N/A"
            loc_str  = locations[i].strip() if i < len(locations) else "N/A"
            victim   = victims[i].strip() if i < len(victims) else "N/A"
            narr_raw = narratives[i].strip() if i < len(narratives) else "N/A"

            # ...

Explanation:

  • PDF Reading: Utilizes PyPDF2.PdfReader to extract text from each page of the PDF.

  • Regex Patterns: Defined to capture specific sections like Complaint Number, Offense, Date, Time, Location, Victim Address, and Narrative. Of note is capturing different variations of date and of capturing various spellings.

  • Data Extraction: Uses re.findall to extract relevant data based on the defined patterns.

  • Data Cleaning: Processes extracted data to ensure consistency and accuracy.

  • Duplicate Handling: Skips complaints that have already been processed to prevent duplication.

Detailed Code Breakdown

Let’s delve deeper into some of the critical functions within utils.py to understand their roles and implementations.

a. Environment Variables Loading
Code
def load_env_vars(file_path):
    """
    Load environment variables from a file and set them in os.environ.

    Args:
        file_path (str or Path): The path to the environment variables file.

    Raises:
        FileNotFoundError: If the specified file does not exist.
    """
    env_file = Path(file_path)
    if not env_file.exists():
        raise FileNotFoundError(f"Environment file '{file_path}' not found.")
    
    with env_file.open('r') as f:
        for line in f:
            # Remove leading/trailing whitespace
            line = line.strip()
            # Skip empty lines and comments
            if not line or line.startswith('#'):
                continue
            # Split into key and value
            if '=' in line:
                key, value = line.split('=', 1)
                key = key.strip()
                value = value.strip()
                os.environ[key] = value
                print(f"Loaded environment variable")  
            else:
                print(f"Ignoring invalid line in env file")  

Functionality:

  • Purpose: Reads a file containing environment variables and sets them in the os.environ dictionary for use throughout the application.

  • Error Handling: Raises a FileNotFoundError if the specified environment file does not exist.

  • Parsing Logic:

    • Ignores empty lines and lines starting with # (comments).

    • Splits each valid line into a key-value pair based on the = delimiter.

    • Trims whitespace and sets the environment variable.

Usage: This function ensures that sensitive information like API keys and repository paths are not hardcoded into the scripts but are instead loaded securely from an external file.

b. Location Normalization
Code
def normalize_location(loc_str):
    """
    Normalize the location string to ensure consistency in caching.
    
    Steps:
    - Convert to lowercase.
    - Remove leading/trailing whitespace.
    - Remove punctuation.
    - Replace multiple spaces with a single space.
    - Standardize common street suffixes.
    
    Args:
        loc_str (str or float): The original location string.

    Returns:
        str: The normalized location string.
    """
    if not isinstance(loc_str, str):
        if pd.isna(loc_str):
            loc_str = ""
        else:
            loc_str = str(loc_str)
    
    if not loc_str:
        return ""
    
    # Convert to lowercase
    loc_str = loc_str.lower()
    # Remove leading/trailing whitespace
    loc_str = loc_str.strip()
    # Remove punctuation
    loc_str = loc_str.translate(str.maketrans('', '', string.punctuation))
    # Replace multiple spaces with a single space
    loc_str = re.sub(r'\s+', ' ', loc_str)
    # Standardize suffixes
    loc_str = standardize_suffix(loc_str)
    return loc_str

Functionality:

  • Purpose: Ensures that location strings are consistently formatted to improve caching efficiency and reduce redundancy.

  • Normalization Steps:

    1. Type Handling: Converts non-string inputs to strings, handling missing values gracefully.

    2. Case Conversion: Transforms the string to lowercase to ensure case-insensitive matching.

    3. Whitespace Trimming: Removes unnecessary leading and trailing spaces.

    4. Punctuation Removal: Strips out all punctuation to avoid discrepancies caused by different punctuation marks.

    5. Whitespace Reduction: Collapses multiple spaces into a single space for uniformity.

    6. Suffix Standardization: Converts common street suffixes (e.g., “st” to “street”) to a standardized form.

Conclusion of Data Parsing Section

The Data Parsing component is meticulously crafted to automate the extraction and preparation of crime data from PDF reports. By leveraging robust libraries like PyPDF2, googlemaps, and BeautifulSoup, alongside custom utility functions, the scripts ensure data accuracy, consistency, and efficiency. Caching mechanisms play a pivotal role in optimizing performance by preventing redundant downloading, processing and minimizing API calls. Additionally, seamless integration with GitHub facilitates version control and data dissemination.


Live Streamlit Dashboard

Overview

The Live Streamlit Dashboard serves as the interactive frontend of the Oak Park Crime Reporting project. It provides users with a dynamic interface to explore, filter, and visualize crime data within Oak Park. Leveraging Streamlit’s capabilities, the dashboard offers:

  • Interactive Maps: Visual representation of crime incidents using Folium.

  • Dynamic Filters: Date range and offense type filters to customize data views.

  • Email Subscription: Integration with Mailchimp for users to subscribe or unsubscribe from updates.

  • Navigation Links: Quick access to related resources like Portfolio, Blog, Documentation, and signing up for Email Updates.

The primary script responsible for this component is streamlit_app.py.

Key Components

1. streamlit_app.py

Purpose:
This script builds the interactive Streamlit dashboard, enabling users to filter crime data by date and offense type, visualize the data on a map, and manage email subscriptions for updates.

Imports and Dependencies:

Code
import streamlit as st
import pandas as pd
import folium
from streamlit_folium import st_folium
from datetime import datetime, timedelta
import numpy as np
import re
import hashlib
import requests

# Define Mailchimp API details from secrets
MAILCHIMP_API_KEY = st.secrets["mailchimp"]["api_key"]
MAILCHIMP_AUDIENCE_ID = st.secrets["mailchimp"]["audience_id"]
MAILCHIMP_DATA_CENTER = st.secrets["mailchimp"]["data_center"]

# Mailchimp API endpoint
MAILCHIMP_API_URL = f"https://{MAILCHIMP_DATA_CENTER}.api.mailchimp.com/3.0"
  • Standard Libraries: datetime, timedelta, re, hashlib, requests

  • Third-Party Libraries: streamlit, pandas, folium, streamlit_folium, numpy

  • Mailchimp Integration: Accesses Mailchimp API credentials securely via st.secrets

Main Functionalities:

  1. Data Loading: Reads and caches the processed crime data from a ZIP file.

  2. Disclaimer Enforcement: Presents a legal disclaimer that users must agree to before accessing the dashboard.

  3. Email Subscription Management: Allows users to subscribe or unsubscribe from email updates via Mailchimp.

  4. Navigation Links: Provides quick access to Portfolio, Blog, Documentation, Contact, Email Updates, and a Cumulative Map.

  5. Interactive Filters: Users can filter crime data by date range and offense type.

  6. Data Visualization: Displays filtered crime incidents on an interactive Folium map with detailed popups.

Sample Code Snippets and Explanations:

a. Safe Field Handling
Code
def safe_field(value):
    """
    Return the string version of a field or 'Not found' if it's missing/NaN/empty.
    """
    if pd.isnull(value) or value == "":
        return "Not found"
    return str(value)

Explanation:

  • Purpose: Ensures that all fields displayed in the dashboard are present and readable. If a field is missing (NaN) or empty, it returns “Not found” to maintain consistency in the UI.
b. Data Loading with Caching
Code
@st.cache_data
def load_data():
    """
    Reads 'summary_report.zip' once, caching the DataFrame in memory.
    This prevents re-reading the file on every app rerun.
    """
    df = pd.read_csv("data/summary_report.zip", compression="zip", encoding="cp1252")
    return df

Explanation:

  • Functionality: Loads the crime data from a compressed ZIP file and caches it using Streamlit’s @st.cache_data decorator.

  • Benefits:

    • Performance: Reduces load times by preventing redundant reads.

    • Efficiency: Ensures that the dashboard remains responsive, especially with large datasets.

c. Disclaimer Gate
Code
def show_disclaimer():
    """
    Show a disclaimer 'gate' that the user must agree to in order to proceed.
    """
    st.markdown(
        """
        # Important Legal Disclaimer
        
        **By using this demonstrative research tool, you acknowledge and agree**:
        
        - This tool is for **demonstration purposes only**.
        - The data originated from publicly available Oak Park Police Department PDF files.
          View the official site here: [Oak Park Police Department](https://www.oak-park.us/village-services/police-department).
        - During parsing, **~10%** of complaints were **omitted** due to parsing issues; 
          thus the data is **incomplete**.
        - The **official** and **complete** PDF files remain with the Oak Park Police Department.
        - You **will not hold** the author **liable** for **any** decisions—formal or informal—based on this tool.
        - This tool **should not** be used in **any** official or unofficial **decision-making**.
        
        By continuing, you indicate your acceptance of these terms and disclaim all liability. 
        """
    )
    agree = st.checkbox("I have read the disclaimer and I agree to continue.")
    if agree:
        st.session_state["user_agreed"] = True
        st.rerun()
    else:
        st.stop()

Explanation:

  • Purpose: Presents a mandatory disclaimer to users, ensuring they acknowledge the tool’s limitations and liabilities before accessing the dashboard.

  • Mechanism:

    • Checkbox: Users must check the box indicating their agreement to proceed.

    • Session State: Utilizes st.session_state to remember the user’s agreement across interactions.

    • Flow Control: If the user does not agree, the app stops rendering further content.

d. Email Validation and Subscription Functions
Code
def validate_email(email):
    """
    Validates the email format using regex.
    """
    email_regex = r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"
    return re.match(email_regex, email) is not None

def subscribe_email(email):
    """
    Subscribes an email to the Mailchimp audience.
    """
    # Mailchimp requires the subscriber hash, which is the MD5 hash of the lowercase version of the email
    email_lower = email.lower().encode()
    subscriber_hash = hashlib.md5(email_lower).hexdigest()

    url = f"{MAILCHIMP_API_URL}/lists/{MAILCHIMP_AUDIENCE_ID}/members/{subscriber_hash}"

    data = {
        "email_address": email,
        "status": "subscribed",  # Explicitly set status to 'subscribed'
        "status_if_new": "subscribed"  # Ensure new members are subscribed
    }

    response = requests.put(
        url,
        auth=("anystring", MAILCHIMP_API_KEY),
        json=data
    )

    return response

def unsubscribe_email(email):
    """
    Unsubscribes an email from the Mailchimp audience.
    """
    # Mailchimp requires the subscriber hash, which is the MD5 hash of the lowercase version of the email
    email_lower = email.lower().encode()
    subscriber_hash = hashlib.md5(email_lower).hexdigest()

    url = f"{MAILCHIMP_API_URL}/lists/{MAILCHIMP_AUDIENCE_ID}/members/{subscriber_hash}"

    data = {
        "status": "unsubscribed"
    }

    response = requests.patch(
        url,
        auth=("anystring", MAILCHIMP_API_KEY),
        json=data
    )

    return response

Explanation:

  • Email Validation:

    • Purpose: Ensures that users enter a correctly formatted email address before attempting subscription or unsubscription.

    • Method: Uses a regular expression to match standard email formats.

  • Subscription Functions:

    • subscribe_email:

      • Functionality: Subscribes a user to the Mailchimp audience list.

      • Process:

        • Hashing: Generates an MD5 hash of the email to comply with Mailchimp’s API requirements.

        • API Request: Sends a PUT request to Mailchimp to add or update the subscriber’s status to “subscribed”.

    • unsubscribe_email:

      • Functionality: Removes a user from the Mailchimp audience list.

      • Process:

        • Hashing: Similar to the subscription function.

        • API Request: Sends a PATCH request to update the subscriber’s status to “unsubscribed”.

f. Main Application Logic
Code
def main_app():
    """
    The main body of the application: date filters, offense filter, map, etc.
    """

    st.title("Oak Park Crime Map")

    # 1) Load data from the ZIP (cached)
    df = load_data()

    # Convert 'Date' to datetime, remove rows with date=1900 or missing lat/long
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df = df[(df['Date'].notna()) & (df['Date'] != pd.Timestamp("1900-01-01"))]
    df = df.dropna(subset=['Lat', 'Long'])

    # Determine data's min/max date
    min_date_in_data = df['Date'].min().date()
    max_date_in_data = df['Date'].max().date()

    # We'll still allow up to 3 months, but default to just the last 1 month for faster display
    today = datetime.now().date()
    
    # Default = last 1 month
    # e.g. 31 days; you can do 30 if you prefer
    default_start = max(min_date_in_data, today - timedelta(days=31))
    default_end   = min(today, max_date_in_data)

    # Create columns with ratio [1, 2]: left filter (narrow), right map (wider)
    col_filter, col_map = st.columns([1, 2], gap="small")

    with col_filter:
        st.subheader("Date Range (up to 3 months)")

        # Start & End date pickers
        start_date = st.date_input(
            "Start Date",
            value=default_start,
            min_value=min_date_in_data,
            max_value=max_date_in_data
        )
        end_date = st.date_input(
            "End Date",
            value=default_end,
            min_value=min_date_in_data,
            max_value=max_date_in_data
        )

        # Validate date logic
        if end_date < start_date:
            st.warning("End date cannot be before start date. Please adjust.")
            st.stop()

        date_diff = (end_date - start_date).days
        # Still enforce up to 3 months (~92 days)
        if date_diff > 92:
            st.warning("Time range cannot exceed ~3 months (92 days). Please shorten.")
            st.stop()

        # Filter by date
        start_dt = pd.to_datetime(start_date)
        end_dt   = pd.to_datetime(end_date) + pd.Timedelta(days=1)  # inclusive
        date_mask = (df['Date'] >= start_dt) & (df['Date'] < end_dt)
        partial_df = df[date_mask]

        if partial_df.empty:
            st.info("No records found for the selected date range.")
            st.stop()

        # Dynamically gather offenses from partial_df
        unique_offenses = sorted(partial_df['Offense'].dropna().unique())

        st.subheader("Offense Filter")
        with st.expander("Select Offenses (scrollable)", expanded=False):
            if not unique_offenses:
                st.write("No offenses found for this date range.")
                selected_offenses = []
            else:
                # By default, no offenses => show all
                selected_offenses = st.multiselect(
                    "Offense(s)",
                    options=unique_offenses,
                    default=[],  # empty => show all
                    help="Scroll to find more offenses. If empty => show all."
                )

    # If user picks no offense => show all
    if selected_offenses:
        final_df = partial_df[partial_df['Offense'].isin(selected_offenses)]
    else:
        final_df = partial_df

    if final_df.empty:
        st.info("No records found for the selected offense(s).")
        st.stop()

    # Truncate to 2,000
    total_recs = len(final_df)
    if total_recs > 2000:
        st.info(f"There are {total_recs} matching records. Showing only the first 2,000.")
        final_df = final_df.iloc[:2000]
    
    base_url_static = 'https://www.oak-park.us/sites/default/files/police/summaries/'

    with col_map:
        st.write(f"**Displaying {len(final_df)} records on the map**.")

        # Create Folium map
        oak_park_center = [41.885, -87.78]
        crime_map = folium.Map(location=oak_park_center, zoom_start=13)

        for _, row in final_df.iterrows():
            complaint   = safe_field(row.get('Complaint #'))
            offense_val = safe_field(row.get('Offense'))
            date_val    = row.get('Date')
            date_str    = safe_field(date_val.strftime('%Y-%m-%d') if pd.notnull(date_val) else np.nan)
            time_val    = safe_field(row.get('Time'))
            location    = safe_field(row.get('Location'))
            victim      = safe_field(row.get('Victim/Address'))
            narrative   = safe_field(row.get('Narrative'))
            filename = safe_field(row.get('File Name'))
            # Extract the year from the filename
            year = extract_year(filename)
            if year:
                base_url = f"{base_url_static}{year}/"
                link = f"{base_url}{filename}"
            else:
                # Handle cases where the year isn't found or is out of range
                link = "#"
                st.warning(f"Year not found or out of range in filename: {filename}")

            popup_html = f"""
            <b>Complaint #:</b> {complaint}<br/>
            <b>Offense:</b> {offense_val}<br/>
            <b>Date:</b> {date_str}<br/>
            <details>
              <summary><b>View Details</b></summary>
              <b>Time:</b> {time_val}<br/>
              <b>Location:</b> {location}<br/>
              <b>Victim:</b> {victim}<br/>
              <b>Narrative:</b> {narrative}<br/>
              <b>URL:</b> <a href="{link}" target="_blank">PDF Link</a>
            </details>
        """

            folium.Marker(
                location=[row['Lat'], row['Long']],
                popup=folium.Popup(popup_html, max_width=400),
                tooltip=f"Complaint # {complaint}",
                icon=folium.Icon(color="blue", icon="info-sign")
            ).add_to(crime_map)

        st_folium(crime_map, width=1000, height=1000, use_container_width=True)

Explanation:

  • Title: Sets the dashboard’s title to “Oak Park Crime Map”.

  • Data Filtering:

    • Date Conversion: Converts the ‘Date’ column to datetime objects, removing entries with invalid dates (e.g., “1900-01-01”) or missing geographical coordinates.

    • Date Range Determination: Identifies the earliest and latest dates in the dataset to set the boundaries for date selection.

    • Default Dates: Sets the default date range to the last 31 days, ensuring that the data displayed is recent and relevant.

  • Layout Setup:

    • Columns: Utilizes Streamlit’s st.columns to create a two-column layout:

      • Left Column (col_filter): Contains filters for date range and offense type.

      • Right Column (col_map): Displays the interactive Folium map.

  • Interactive Filters:

    • Date Range Picker: Allows users to select a start and end date within the permissible range.

      • Validation: Ensures that the end date is not before the start date and that the selected range does not exceed three months (92 days).
    • Offense Type Multiselect:

      • Dynamic Options: Populates the multiselect options based on the offenses present in the filtered date range.

      • Default Behavior: If no offenses are selected, all offenses are displayed.

  • Data Truncation:

    • Limitation: Caps the number of records displayed on the map to 2,000 to maintain performance and usability.
  • Map Generation:

    • Folium Map: Centers the map on Oak Park’s geographical coordinates.

    • Markers:

      • Customization: Each crime incident is represented by a blue marker with an info-sign icon.

      • Popups: Clicking on a marker reveals detailed information about the incident, including a link to the original PDF report.

  • Streamlit Folium Integration:

    • st_folium: Embeds the Folium map within the Streamlit app, ensuring seamless interactivity.
g. Main Function to Control App Flow
Code
def main():
    # Check if user has agreed to disclaimer
    if "user_agreed" not in st.session_state:
        st.session_state["user_agreed"] = False

    if not st.session_state["user_agreed"]:
        show_disclaimer()
    else:
        # Set the page layout to wide
        st.set_page_config(page_title="Oak Park Crime", layout="wide")

        # Add horizontal navigation links at the top
        add_top_links()

        # Proceed with the main application
        main_app()

        # # Add Email Updates section at the bottom
        # add_email_subscription()

if __name__ == "__main__":
    main()

Explanation:

  • Disclaimer Check:

    • Session State: Utilizes st.session_state to track whether the user has agreed to the disclaimer.

    • Flow Control: If the user hasn’t agreed, the disclaimer is displayed; otherwise, the main application proceeds.

  • Page Configuration:

    • Layout: Sets the Streamlit app layout to “wide” for better use of screen real estate.

    • Page Title: Labels the browser tab as “Oak Park Crime”.

  • Navigation Links: Invokes add_top_links() to display navigation links at the top of the dashboard.

  • Main Application Execution: Calls main_app() to render the interactive filters and map.

  • Email Subscription: The add_email_subscription() function is present but commented out, now instead of potentially exposing the site to some sort of injection type attack we redirect users to a simple google forms sheet to subscribe/unsubscribe.

Detailed Code Breakdown

Let’s delve deeper into specific functions and sections of streamlit_app.py to understand their roles and implementations.

a. Email Subscription Management

Although the add_email_subscription() function is commented out, it’s essential to understand its intended functionality, as it may be incorporated in the future in some one off project of mine and this documentation would greatly assist..

Code
def add_email_subscription():
    """
    Displays subscription and unsubscription forms at the bottom of the page.
    The forms are within a collapsed expander that the user can expand manually.
    """
    # Add an anchor to scroll to
    st.markdown('<a id="email-updates"></a>', unsafe_allow_html=True)
    
    # **3. Implement Email Updates within a Collapsed Expander**
    with st.expander("📧 Email Updates", expanded=False):
        st.markdown("### Subscribe to Email Updates")
        with st.form("email_subscription_form"):
            subscribe_email_input = st.text_input("Enter your email address to subscribe:")
            subscribe_submit = st.form_submit_button("Subscribe")

            if subscribe_submit:
                if validate_email(subscribe_email_input):
                    response = subscribe_email(subscribe_email_input)
                    if response.status_code == 200:
                        # Check if the email was already subscribed
                        response_data = response.json()
                        status = response_data.get("status")
                        if status == "subscribed":
                            # Check if the 'previous_status' was 'unsubscribed' to provide accurate feedback
                            previous_status = response_data.get("status_if_new")
                            if previous_status == "subscribed":
                                st.success("Subscription successful! You've been resubscribed to the email list.")
                            else:
                                st.success("Subscription successful! You've been added to the email list.")
                        else:
                            st.info("You are already subscribed.")
                    else:
                        # Handle errors
                        error_message = response.json().get('detail', 'An error occurred.')
                        st.error(f"Subscription failed: {error_message}")
                else:
                    st.error("Please enter a valid email address.")

        st.markdown("---")  # Separator

        st.markdown("### Unsubscribe from Email Updates")
        with st.form("email_unsubscription_form"):
            unsubscribe_email_input = st.text_input("Enter your email address to unsubscribe:")
            unsubscribe_submit = st.form_submit_button("Unsubscribe")

            if unsubscribe_submit:
                if validate_email(unsubscribe_email_input):
                    response = unsubscribe_email(unsubscribe_email_input)
                    if response.status_code == 200:
                        response_data = response.json()
                        status = response_data.get("status")
                        if status == "unsubscribed":
                            st.success("You have been unsubscribed successfully.")
                        else:
                            st.info("Your email was not found in our list.")
                    else:
                        # Handle errors
                        error_message = response.json().get('detail', 'An error occurred.')
                        st.error(f"Unsubscription failed: {error_message}")
                else:
                    st.error("Please enter a valid email address.")

Explanation:

  • Purpose: Provides users with forms to subscribe or unsubscribe from email updates.

  • Structure:

    • Expander: Collapses the email update section to keep the dashboard clean.

    • Subscription Form:

      • Input: Email address field.

      • Submission: Validates and processes the subscription request via Mailchimp.

      • Feedback: Displays success or error messages based on the API response.

    • Unsubscription Form:

      • Input: Email address field.

      • Submission: Validates and processes the unsubscription request via Mailchimp.

      • Feedback: Displays success or error messages based on the API response.

  • Current Status: The entire function is commented out, indicating it’s not active(obvious).

b. Year Extraction from Filename

Code
def extract_year(filename, start_year=2017, end_year=2030):
    """
    Extracts a four-digit year from the filename.
    Returns the year as a string if found and within the range.
    Returns None otherwise.
    """
    match = re.search(r'(20[1][7-9]|20[2][0-9]|2030)', filename)
    if match:
        return match.group(0)
    return None

Explanation:

  • Purpose: Retrieves the year from the PDF filename to construct accurate URLs linking back to the original reports.

  • Logic:

    • Regex Pattern: Searches for years ranging from 2017 to 2030. If this tool is still in use by 2030 and/or crime records pre 2018 are on the oak park website or somewhere else then this should be fixed.

    • Return Value: Provides the matched year as a string or None if no valid year is found.

Usage in main_app():

  • URL Construction: Utilizes the extracted year to build the base URL for the PDF link.

  • Fallback Handling: If the year isn’t found, the PDF link defaults to #, and a warning is displayed.

c. Interactive Map Rendering

Code
with col_map:
    st.write(f"**Displaying {len(final_df)} records on the map**.")

    # Create Folium map
    oak_park_center = [41.885, -87.78]
    crime_map = folium.Map(location=oak_park_center, zoom_start=13)

    for _, row in final_df.iterrows():
        complaint   = safe_field(row.get('Complaint #'))
        offense_val = safe_field(row.get('Offense'))
        date_val    = row.get('Date')
        date_str    = safe_field(date_val.strftime('%Y-%m-%d') if pd.notnull(date_val) else np.nan)
        time_val    = safe_field(row.get('Time'))
        location    = safe_field(row.get('Location'))
        victim      = safe_field(row.get('Victim/Address'))
        narrative   = safe_field(row.get('Narrative'))
        filename = safe_field(row.get('File Name'))
        # Extract the year from the filename
        year = extract_year(filename)
        if year:
            base_url = f"{base_url_static}{year}/"
            link = f"{base_url}{filename}"
        else:
            # Handle cases where the year isn't found or is out of range
            link = "#"
            st.warning(f"Year not found or out of range in filename: {filename}")

        popup_html = f"""
        <b>Complaint #:</b> {complaint}<br/>
        <b>Offense:</b> {offense_val}<br/>
        <b>Date:</b> {date_str}<br/>
        <details>
          <summary><b>View Details</b></summary>
          <b>Time:</b> {time_val}<br/>
          <b>Location:</b> {location}<br/>
          <b>Victim:</b> {victim}<br/>
          <b>Narrative:</b> {narrative}<br/>
          <b>URL:</b> <a href="{link}" target="_blank">PDF Link</a>
        </details>
    """

        folium.Marker(
            location=[row['Lat'], row['Long']],
            popup=folium.Popup(popup_html, max_width=400),
            tooltip=f"Complaint # {complaint}",
            icon=folium.Icon(color="blue", icon="info-sign")
        ).add_to(crime_map)

    st_folium(crime_map, width=1000, height=1000, use_container_width=True)

Explanation:

  • Map Initialization:

    • Centering: The map is centered on Oak Park’s geographical coordinates with a zoom level of 13 for optimal visibility.
  • Marker Creation:

    • Looping Through Data: Iterates over each record in the filtered DataFrame (final_df).

    • Data Extraction: Retrieves necessary fields like Complaint Number, Offense, Date, Time, Location, Victim Address, Narrative, and File Name.

    • URL Formation: Constructs a direct link to the original PDF report using the extracted year. If the year isn’t found, the link defaults to #, and a warning is displayed.

    • Popup HTML: Formats the incident details into an HTML structure that includes expandable sections (<details>) for additional information and the PDF link.

  • Folium Marker Customization:

    • Location: Plots the marker based on latitude and longitude.

    • Popup: Attaches the formatted HTML popup to the marker.

    • Tooltip: Displays the Complaint Number when hovering over the marker.

    • Icon: Uses a blue info-sign icon for consistency and visibility.

  • Map Embedding:

    • st_folium: Renders the Folium map within the Streamlit dashboard, allowing for interactivity like zooming and panning.

Conclusion of Live Streamlit Dashboard Section

The Live Streamlit Dashboard is a pivotal component of the Oak Park Crime Reporting project, offering users an intuitive and interactive means to explore crime data. By integrating dynamic filters, interactive maps, and email subscription management, the dashboard enhances user engagement and data accessibility. The modular design, leveraging Streamlit’s capabilities and third-party integrations like Mailchimp, ensures scalability and maintainability.


Static HTML Generation

Overview

The Static HTML Generation component automates the creation of static HTML reports and maps based on the latest crime data. This ensures that the information remains accessible even outside the interactive Streamlit dashboard environment. The component performs the following tasks:

  1. Data Aggregation: Compiles crime data for the past week.

  2. Map Creation: Generates interactive and cumulative Folium maps with detailed popups.

  3. GitHub Integration: Uploads the generated HTML reports and CSV files to a GitHub repository, facilitating easy sharing and hosting via GitHub Pages.

  4. Email Dissemination: Sends automated emails containing the latest reports and links to subscribers.

The primary script responsible for this component is weekly_crime_report.py.

Key Components

1. weekly_crime_report.py

Purpose:
This script automates the generation of weekly crime reports, including interactive maps and CSV data files. It handles data filtering, map creation with disclaimers, uploading to GitHub, and sending out emails to subscribers with the latest reports.

Imports and Dependencies:

Code
import os
import pandas as pd
import folium
import logging
import zipfile
import time
from pathlib import Path
from datetime import datetime, timedelta
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.application import MIMEApplication
import base64
import sys

import numpy as np
import re
import hashlib
import requests

from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# Local utility to load env vars
from utils import load_env_vars, extract_year, upload_file_to_github, upload_files_to_github_batch

# Folium plugins
from folium.plugins import MarkerCluster

# Define Gmail API scope
SCOPES = ['https://www.googleapis.com/auth/gmail.send']
  • Standard Libraries: os, re, json, pandas, pathlib, logging, collections, time, datetime, string, zipfile, email, base64, sys

  • Third-Party Libraries: numpy, folium, google-auth, google-auth-oauthlib, google-auth-httplib2, google-api-python-client, requests

  • Local Utilities: Functions imported from utils.py

Main Functionalities:

  1. Environment Setup:

    • Loads environment variables from env_vars.txt.

    • Configures logging.

    • Sets up directories for data, maps, CSVs, and GitHub integration.

  2. Data Processing:

    • Loads all crime data from a compressed ZIP file.

    • Filters data for the past week.

    • Writes the filtered data to a CSV file.

  3. Map Creation:

    • Generates an interactive Folium map with markers for each crime incident.

    • Generates a cumulative Folium map displaying all crime incidents.

    • Adds disclaimers as overlaying HTML elements on the maps.

  4. GitHub Integration:

    • Uploads the generated HTML maps and CSV files to specified GitHub repository subfolders.
  5. Email Dissemination:

    • Authenticates with the Gmail API.

    • Prepares and sends an email to subscribers with links to the latest reports and attached CSV files.

  6. Logging and Reporting:

    • Logs processing steps, errors, and summary statistics.

    • Measures and records the execution time of the script.

Sample Code Snippet:

Code
def create_folium_map_filtered_data(
    df,
    lat_col='Lat',
    lng_col='Long',
    offense_col='Offense',
    date_col='Date',
    output_html_path='weekly_map.html'
):
    """
    Creates a Folium map plotting each record in df with disclaimers overlay (JS + HTML) 
    and saves it to 'output_html_path'.
    """
    oak_park_center = [41.885, -87.78]
    crime_map = folium.Map(location=oak_park_center, zoom_start=13)

    marker_cluster = MarkerCluster().add_to(crime_map)
    # Define the static part of the base URL
    base_url_static = 'https://www.oak-park.us/sites/default/files/police/summaries/'

    for _, row in df.iterrows():
        lat = row[lat_col]
        lng = row[lng_col]
        offense = row.get(offense_col, "Unknown")
        complaint = safe_field(row.get('Complaint #'))
        offense_val = safe_field(offense)
        date_str = safe_field(row['Date'].strftime('%Y-%m-%d') if pd.notnull(row['Date']) else np.nan)
        time_val = safe_field(row.get('Time'))
        location = safe_field(row.get('Location'))
        victim = safe_field(row.get('Victim/Address'))
        narrative = safe_field(row.get('Narrative'))
        filename = safe_field(row.get('File Name'))

        popup_html = f"""
            <b>Complaint #:</b> {complaint}<br/>
            <b>Offense:</b> {offense_val}<br/>
            <b>Date:</b> {date_str}<br/>
            <details>
              <summary><b>View Details</b></summary>
              <b>Time:</b> {time_val}<br/>
              <b>Location:</b> {location}<br/>
              <b>Victim:</b> {victim}<br/>
              <b>Narrative:</b> {narrative}<br/>
              <b>URL:</b> <a href="{filename}" target="_blank">PDF Link</a>
            </details>
        """

        folium.Marker(
            location=[lat, lng],
            popup=folium.Popup(popup_html, max_width=300),
            icon=folium.Icon(color='red', icon='info-sign')
        ).add_to(marker_cluster)

    # Basic map title
    title_html = '''
    <h3 align="center" style="font-size:20px"><b>Oak Park Crime Map</b></h3>
    <br>
    <h3 align="center" style="font-size:10px">
    <a href="https://jesse-anderson.net/">My Portfolio</a> |
    <a href="https://blog.jesse-anderson.net/">My Blog</a> |
    <a href="https://blog.jesse-anderson.net/">Documentation</a> |
    <a href="mailto:jesse@jesse-anderson.net">Contact</a> |
    <a href="https://forms.gle/GnyaVwo1Vzm8nBH6A">
        Add me to Weekly Updates
    </a>
    </h3>
'''
    crime_map.get_root().html.add_child(folium.Element(title_html))

    # 1) Overlays disclaimers in a "splash screen" with JavaScript:
    disclaimers_overlay = """
    <style>
    /* Full-page overlay styling */
    #disclaimerOverlay {
      position: fixed;
      z-index: 9999; /* On top of everything */
      left: 0;
      top: 0;
      width: 100%;
      height: 100%;
      background-color: rgba(255, 255, 255, 0.95);
      color: #333;
      display: block; /* Visible by default */
      overflow: auto;
      text-align: center;
      padding-top: 100px;
      font-family: Arial, sans-serif;
    }
    #disclaimerContent {
      background: #f9f9f9;
      border: 1px solid #ccc;
      display: inline-block;
      padding: 20px;
      max-width: 800px;
      text-align: left;
    }
    #acceptButton {
      margin-top: 20px;
      padding: 10px 20px;
      font-size: 16px;
      cursor: pointer;
    }
    </style>

    <div id="disclaimerOverlay">
      <div id="disclaimerContent">
        <h2>Important Legal Disclaimer</h2>
        <p><strong>By using this demonstrative research tool, you acknowledge and agree:</strong></p>
        <ul>
            <li>This tool is for <strong>demonstration purposes only</strong>.</li>
            <li>The data originated from publicly available Oak Park Police Department PDF files.
                View the official site here: 
                <a href="https://www.oak-park.us/village-services/police-department"
                   target="_blank">Oak Park Police Department</a>.</li>
            <li>During parsing, <strong>~10%</strong> of complaints were <strong>omitted</strong> 
                due to parsing issues; thus the data is <strong>incomplete</strong>.</li>
            <li>The <strong>official</strong> and <strong>complete</strong> PDF files remain 
                with the Oak Park Police Department.</li>
            <li>You <strong>will not hold</strong> the author <strong>liable</strong> for <strong>any</strong> 
                decisions—formal or informal—based on this tool.</li>
            <li>This tool <strong>should not</strong> be used in <strong>any</strong> official or unofficial 
                <strong>decision-making</strong>.</li>
        </ul>
        <p><strong>By continuing, you indicate your acceptance of these terms 
           and disclaim all liability.</strong></p>
        <hr/>
        <button id="acceptButton" onclick="hideOverlay()">I Accept</button>
      </div>
    </div>

    <script>
    function hideOverlay() {
      var overlay = document.getElementById('disclaimerOverlay');
      overlay.style.display = 'none'; 
    }
    </script>
    """

    disclaimers_element = folium.Element(disclaimers_overlay)
    crime_map.get_root().html.add_child(disclaimers_element)

    # 2) Save final HTML
    crime_map.save(str(output_html_path))

Explanation:

  • Map Initialization:

    • Centering: Centers the Folium map on Oak Park’s geographical coordinates with a zoom level of 13 for optimal visibility.
  • Marker Creation:

    • Customization: Each crime incident is represented by a red marker with an info-sign icon.

    • Popups: Clicking on a marker reveals detailed information about the incident, including a link to the original PDF report.

  • Title Addition:

    • HTML Styling: Adds a title and navigation links directly onto the Folium map using HTML and inline CSS.
  • Disclaimer Overlay:

    • Purpose: Overlays a full-page disclaimer that users must accept before interacting with the map.

    • Implementation:

      • CSS: Styles the overlay to cover the entire page with semi-transparent background.

      • HTML: Structures the disclaimer content within a styled div.

      • JavaScript: Provides a function to hide the overlay when the user clicks the “I Accept” button.

  • Map Saving:

    • Output: Saves the generated map with overlays to the specified HTML file path.
d. Gmail API Service Setup
Code
def get_gmail_service():
    """
    Authenticates the user and returns the Gmail API service.
    """
    creds = None
    token_path = Path('token.json')

    if token_path.exists():
        creds = Credentials.from_authorized_user_file(str(token_path), SCOPES)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            try:
                creds.refresh(Request())
                logging.info("Credentials refreshed successfully.")
            except Exception as e:
                logging.error(f"Error refreshing credentials: {e}")
                print(f"[ERROR] Could not refresh credentials: {e}")
                return None
        else:
            try:
                flow = InstalledAppFlow.from_client_secrets_file(
                    'credentials.json', SCOPES
                )
                creds = flow.run_local_server(port=0)
                logging.info("Authentication flow completed successfully.")
            except Exception as e:
                logging.error(f"Error during OAuth flow: {e}")
                print(f"[ERROR] Could not complete OAuth flow: {e}")
                return None

        try:
            with open(token_path, 'w') as token:
                token.write(creds.to_json())
                logging.info("Credentials saved to token.json.")
        except Exception as e:
            logging.error(f"Failed to save credentials: {e}")
            print(f"[ERROR] Could not save credentials: {e}")

    try:
        service = build('gmail', 'v1', credentials=creds)
        logging.info("Gmail service created successfully.")
        return service
    except HttpError as error:
        logging.error(f"An error occurred while building Gmail service: {error}")
        print(f"[ERROR] An error occurred while building Gmail service: {error}")
        return None

Explanation:

  • Purpose: Authenticates the user and establishes a connection with the Gmail API to send emails.

  • Authentication Flow:

    1. Token Check: Looks for existing credentials in token.json.

    2. Token Refresh: If credentials are expired but have a refresh token, it attempts to refresh them.

    3. OAuth Flow: If no valid credentials exist, initiates the OAuth flow using credentials.json.

    4. Credential Saving: Saves the new or refreshed credentials back to token.json for future use.

  • Error Handling: Logs and prints errors encountered during authentication or service creation.

Usage: This function ensures secure and authenticated access to the Gmail API, enabling the script to send automated emails containing the latest crime reports.

e. Email Sending Function
Code
def send_email_with_disclaimer_and_links(
    service,
    sender_email,
    to_emails,
    subject,
    body_text,
    attachments
):
    """
    Sends an email with disclaimers and links using the Gmail API.
    """
    try:
        message = MIMEMultipart()
        message['to'] = "Undisclosed Recipients <jesse@jesse-anderson.net>"
        message['subject'] = subject
        message['from'] = sender_email

        # disclaimer = """
        # <p><strong>Important Legal Disclaimer</strong></p>
        # <p><strong>By using this demonstrative research tool, you acknowledge and agree:</strong></p>
        # <ul>
        #     <li>This tool is for <strong>demonstration purposes only</strong>.</li>
        #     <li>The data originated from publicly available Oak Park Police Department PDF files.
        #         <a href="https://www.oak-park.us/village-services/police-department">Official site</a>.</li>
        #     <li>During parsing, <strong>~10%</strong> of complaints were <strong>omitted</strong>.</li>
        #     <li>The <strong>official</strong> and <strong>complete</strong> PDF files remain with the Oak Park Police Department.</li>
        #     <li>You <strong>will not hold</strong> the author <strong>liable</strong> for any decisions
        #         based on this tool.</li>
        #     <li>This tool <strong>should not</strong> be used in any official or unofficial <strong>decision-making</strong>.</li>
        # </ul>
        # <p><strong>By continuing, you disclaim all liability.</strong></p>
        # <hr>
        # """

        links = """
        <p>
            <a href="https://jesse-anderson.net/">My Portfolio</a> | 
            <a href="https://blog.jesse-anderson.net/">My Blog</a>
        </p>
        <hr>
        """

        # Define the plain text content
        plain_text = f"""
        Important Legal Disclaimer
        
        By using this demonstrative research tool, you acknowledge and agree:

        - This tool is for demonstration purposes only.
        - The data originated from publicly available Oak Park Police Department PDF files.
          View the official site here: https://www.oak-park.us/village-services/police-department.
        - During parsing, ~10% of total complaints since 2018 were omitted due to parsing issues; 
          thus the data is incomplete.
        - The official and complete PDF files remain with the Oak Park Police Department.
        - You will not hold the author liable for any decisions—formal or informal—based on this tool.
        - This tool should not be used in any official or unofficial decision-making.

        By continuing, you indicate your acceptance of these terms and disclaim all liability.

        ------------

        Hello,
        The crime report from {body_text['start_date']} to {body_text['end_date']} is attached as a .csv file.

        Interactive map:
        {body_text['weekly_map_url']}

        Cumulative map:
        {body_text['cumulative_map_url']}
        
        Last week's data:
        {body_text['csv_url']}
        """

        part1 = MIMEText(plain_text, 'plain')
        message.attach(part1)

        bcc_emails = ", ".join(to_emails)
        message['bcc'] = bcc_emails

        # Attach the CSV
        for file_path in attachments:
            file_path = Path(file_path)
            if not file_path.exists():
                logging.warning(f"Attachment '{file_path}' not found, skipping.")
                continue
            with open(file_path, 'rb') as f:
                mime_application = MIMEApplication(f.read(), Name=file_path.name)
            mime_application['Content-Disposition'] = f'attachment; filename="{file_path.name}"'
            message.attach(mime_application)

        raw_message = base64.urlsafe_b64encode(message.as_bytes()).decode()
        body = {'raw': raw_message}

        sent_message = service.users().messages().send(userId="me", body=body).execute()
        logging.info(f"Email sent successfully. Message ID: {sent_message['id']}")
        print("Crime report email sent successfully.")
    except Exception as e:
        logging.error(f"Failed to send email: {e}")
        print(f"[ERROR] Failed to send email: {e}")

Explanation:

  • Purpose: Composes and sends an email containing the latest crime reports and links to interactive maps.

  • Structure:

    • Email Composition:

      • Plain Text: Provides a disclaimer and details about the latest crime reports.

      • Attachments: Attaches the filtered CSV report.

    • BCC: Sends the email to undisclosed recipients to maintain privacy.

  • Encoding: Encodes the email content in base64 to comply with Gmail API requirements.

  • Sending: Utilizes the authenticated Gmail service to send the email.

  • Error Handling: Logs and prints errors encountered during the email sending process.

f. Main Report Generation Workflow
Code
def main_report_generation():
    """
    Executes the full pipeline:
    1. Load environment variables
    2. Load & filter data for the last 7 days
    3. Create & save Folium map with disclaimers overlay
    4. Upload HTML to GitHub
    5. Send email with CSV attached
    """
    start_time = time.time()
    script_dir = Path(__file__).parent.resolve()

    env_file_path = script_dir / "env_vars.txt"
    try:
        load_env_vars(env_file_path)
    except FileNotFoundError as e:
        print(f"[ERROR] {e}")
        return

    sender_email = os.getenv("SENDER_EMAIL")
    if not sender_email:
        raise ValueError("Missing SENDER_EMAIL in env_vars.txt")

    logging.basicConfig(
        filename=script_dir / 'full_crime_report.log',
        level=logging.DEBUG,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )

    data_dir = script_dir / 'data'
    map_dir = script_dir / 'generated_maps'
    # recipients_csv = script_dir / 'recipients.csv'
    csv_dir = script_dir / 'generated_csvs'  # New directory for CSVs
    map_dir.mkdir(parents=True, exist_ok=True)
    csv_dir.mkdir(parents=True, exist_ok=True)

    zip_file_path = data_dir / 'summary_report.zip'

    execution_date = datetime.now().date()
    date_str = execution_date.strftime('%Y-%m-%d')

    # CSV & HTML output
    filtered_subset_filename = f'filtered_subset_{date_str}.csv'
    filtered_subset_path = csv_dir / filtered_subset_filename
    weekly_map_output_filename = f'crime_map_weekly_{date_str}.html'
    weekly_map_output_path = map_dir / weekly_map_output_filename
    cumulative_map_output_filename = f'crime_map_cumulative.html'
    cumulative_map_output_path = map_dir / cumulative_map_output_filename

    # Local GitHub Pages folder
    github_repo_path = os.getenv("GITHUB_REPO")
    github_repo_path = Path(github_repo_path)

    # (C) Load data
    try:
        df_full = load_all_crimes(zip_file_path)
    except Exception as e:
        logging.error(f"Failed to load all crimes: {e}")
        print(f"[ERROR] Could not load all crimes: {e}")
        return

    if df_full.empty:
        logging.info("No crime data—no map or CSV generated.")
        print("No crime data—no map or CSV generated.")
        return

    # (D) Filter last 7 days
    try:
        start_date, end_date = determine_date_range(df_full, execution_date)
        df_filtered = filter_crime_data(df_full, start_date, end_date)
    except Exception as e:
        logging.error(f"Error determining date range: {e}")
        print(f"[ERROR] Could not determine date range: {e}")
        return

    if df_filtered.empty:
        logging.info("No crimes found—no map or CSV generated.")
        print("No crimes found in the determined date range—no map or CSV generated.")
        return

    # (E) Write filtered CSV
    try:
        df_filtered.to_csv(filtered_subset_path, index=False, encoding="cp1252")
        logging.info(f"Filtered data written to {filtered_subset_path}")
    except Exception as e:
        logging.error(f"Failed to write filtered data to CSV: {e}")
        print(f"[ERROR] Could not write filtered data to CSV: {e}")
        return

    # (F) Create Folium map with disclaimers overlay
    try:
        create_folium_map_filtered_data(
            df=df_filtered,
            lat_col='Lat',
            lng_col='Long',
            offense_col='Offense',
            date_col='Date',
            output_html_path=weekly_map_output_path
        )
        logging.info(f"Folium map created at {weekly_map_output_path}")
    except Exception as e:
        logging.error(f"Error creating Folium map: {e}")
        print(f"[ERROR] Could not create Folium map: {e}")
        return

    try:
        create_folium_map_cumulative(
            df=df_full,  # Use the full dataset for cumulative map
            lat_col='Lat',
            lng_col='Long',
            offense_col='Offense',
            date_col='Date',
            output_html_path=cumulative_map_output_path
        )
        logging.info(f"Cumulative Folium map created at {cumulative_map_output_path}")
    except Exception as e:
        logging.error(f"Error creating cumulative Folium map: {e}")
        print(f"[ERROR] Could not create cumulative Folium map: {e}")
        return
    
    test = False
    if not test:
        # (G) Upload Map and csv
        try:
            # Upload Maps
            files_to_upload_maps = [weekly_map_output_path, cumulative_map_output_path]
            upload_files_to_github_batch(
                file_paths=files_to_upload_maps,
                github_repo_path=github_repo_path,
                target_subfolder='OP-Crime-Maps'
            )
            time.sleep(5) #paranoia
            # Upload CSV
            files_to_upload_csv = filtered_subset_path
            upload_file_to_github(
                file_path=files_to_upload_csv,
                github_repo_path=github_repo_path,
                target_subfolder='OP-Crime-Data'
            )
        except Exception as e:
            logging.error(f"Failed to upload files to GitHub: {e}")
            print(f"[ERROR] Could not upload files to GitHub: {e}")
            return
        # (I) Generate GitHub URLs for the uploaded files
        # Assuming GitHub Pages are served from the root of the repository
        github_base_url = "https://jesse-anderson.github.io"

        weekly_map_url = f"{github_base_url}/OP-Crime-Maps/{weekly_map_output_filename}"
        cumulative_map_url = f"{github_base_url}/OP-Crime-Maps/{cumulative_map_output_filename}"
        csv_url = f"{github_base_url}/OP-Crime-Data/{filtered_subset_filename}"

        # (J) Gmail API & Email
        try:
            service = get_gmail_service()
            if not service:
                raise Exception("Failed to create Gmail service.")
        except Exception as e:
            logging.error(f"Authentication failed: {e}")
            print(f"[ERROR] Authentication failed: {e}")
            return
        time.sleep(60) #time to build github pages....
        # (K) Load Recipients
        try:
            # to_list = load_recipients_list(recipients_csv)
            # to_list = get_mailchimp_subscribers()
            to_list = ["myemail@gmail.com"]
        except FileNotFoundError as e:
            logging.error(f"Error loading recipients: {e}")
            print(f"[ERROR] {e}")
            return

        if not to_list:
            logging.warning("No recipients found—cannot send email.")
            print("No recipients found in recipients.csv—cannot send email.")
            return

        subject = f"Crime Report from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}"
        body_text = {
            'start_date': start_date.strftime('%Y-%m-%d'),
            'end_date': end_date.strftime('%Y-%m-%d'),
            'weekly_map_url': weekly_map_url,
            'cumulative_map_url': cumulative_map_url,
            'csv_url': csv_url
        }
        attachments = []

        try:
            send_email_with_disclaimer_and_links(
                service=service,
                sender_email=sender_email,
                to_emails=to_list,
                subject=subject,
                body_text=body_text,
                attachments=attachments
            )
        except Exception as e:
            logging.error(f"Failed to send email: {e}")
            print(f"[ERROR] Could not send email: {e}")

    end_time = time.time()
    elapsed_sec = end_time - start_time
    logging.info(f"Finished full_crime_report in {elapsed_sec:.2f} seconds.")
    print(f"Finished full_crime_report in {elapsed_sec:.2f} seconds.")

Explanation:

  • Pipeline Steps:

    1. Environment Variables: Loads necessary configurations like API keys and GitHub repository paths.

    2. Data Loading: Retrieves the complete set of crime data from summary_report.zip.

    3. Data Filtering: Selects records from the past seven days to focus the weekly report.

    4. CSV Generation: Exports the filtered data to a CSV file for easy access and dissemination.

    5. Map Creation:

      • Weekly Map: Generates an interactive map highlighting crimes from the past week.

      • Cumulative Map: Generates a comprehensive map showcasing all recorded crimes.

    6. GitHub Upload: Pushes the newly generated HTML maps and CSV reports to designated GitHub repository subfolders.

    7. Email Preparation and Sending:

      • Gmail API Authentication: Ensures secure access to the Gmail API for sending emails.

      • Recipient Loading: Retrieves the list of subscribers.

      • Email Composition: Crafts an email containing links to the latest reports and attaches the CSV file.

      • Email Dispatch: Sends the email to all subscribers.

    8. Logging: Records the entire process’s execution details, including any errors and execution time.

  • Testing Mode:

    • Flag: The test variable allows toggling between testing and production modes. When test = False, the script proceeds with uploading and emailing. This is useful for development and debugging without affecting live data or subscribers.
  • Error Handling:

    • Try-Except Blocks: Enclose critical operations to catch and log exceptions, ensuring the script fails gracefully and provides informative error messages.
g. Main Entry Point
Code
def main():
    """
    Simply runs the report generation from the command line; no Streamlit involved.
    """
    main_report_generation()

if __name__ == "__main__":
    main()

Explanation:

  • Purpose: Defines the script’s entry point, initiating the entire report generation process when the script is executed.

  • Function Call: Invokes main_report_generation() to start the pipeline.

Detailed Code Breakdown

Let’s delve deeper into specific functions and sections of weekly_crime_report.py to understand their roles and implementations.

a. Mailchimp Subscribers Fetching

Code
def get_mailchimp_subscribers():
    """
    Fetches all subscribed members from the Mailchimp audience.
    Handles pagination to retrieve all subscribers.
    """
    MAILCHIMP_API_KEY = os.getenv("MAILCHIMP_API_KEY")
    MAILCHIMP_AUDIENCE_ID = os.getenv("MAILCHIMP_AUDIENCE_ID")
    MAILCHIMP_DATA_CENTER = os.getenv("MAILCHIMP_DATA_CENTER")  # e.g., 'us1', 'us2'

    if not all([MAILCHIMP_API_KEY, MAILCHIMP_AUDIENCE_ID, MAILCHIMP_DATA_CENTER]):
        raise ValueError("Missing Mailchimp API configuration in environment variables.")

    MAILCHIMP_API_URL = f"https://{MAILCHIMP_DATA_CENTER}.api.mailchimp.com/3.0"
    endpoint = f"/lists/{MAILCHIMP_AUDIENCE_ID}/members"
    params = {
        "status": "subscribed",
        "count": 1000,  # Max allowed by Mailchimp
        "offset": 0
    }
    subscribers = []

    while True:
        response = requests.get(
            MAILCHIMP_API_URL + endpoint,
            auth=("anystring", MAILCHIMP_API_KEY),
            params=params
        )

        if response.status_code != 200:
            logging.error(f"Failed to fetch subscribers: {response.status_code} - {response.text}")
            raise Exception(f"Mailchimp API Error: {response.status_code} - {response.text}")

        data = response.json()
        members = data.get('members', [])
        subscribers.extend([member['email_address'] for member in members])

        total_items = data.get('total_items', 0)
        if len(subscribers) >= total_items:
            break
        params['offset'] += params['count']
        time.sleep(1)  # To respect API rate limits

    logging.info(f"Fetched {len(subscribers)} subscribers from Mailchimp.")
    return subscribers

Explanation:

  • Purpose: Retrieves all email subscribers from the specified Mailchimp audience list.

  • Process:

    1. API Configuration: Fetches Mailchimp API credentials from environment variables.

    2. Pagination Handling: Mailchimp’s API returns a maximum of 1,000 subscribers per request. This function loops through pages until all subscribers are retrieved.

    3. API Requests: Sends GET requests to the Mailchimp API, authenticating with the API key.

    4. Data Extraction: Extracts email addresses from the response and appends them to the subscribers list.

    5. Rate Limiting: Introduces a 1-second pause between requests to comply with Mailchimp’s rate limits.

  • Error Handling: Logs and raises exceptions for failed API requests.

Usage: This function ensures that the email dissemination process targets all current subscribers, keeping them updated with the latest crime reports.

b. Date Range Determination

Code
def determine_date_range(df, execution_date):
    """
    Determines the date range for the report based on the execution_date (7 days).
    """
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df_valid = df[df['Date'].notna()].copy()
    if df_valid.empty:
        raise ValueError("No valid dates found in the data.")

    latest_date = df_valid[df_valid['Date'].dt.date <= execution_date]['Date'].max().date()
    if pd.isna(latest_date):
        raise ValueError("No records found on or before today's date.")

    # last 7 days
    end_date = latest_date
    start_date = end_date - timedelta(days=6)

    return start_date, end_date

Explanation:

  • Purpose: Establishes the start and end dates for the weekly report based on the current execution date.

  • Logic:

    1. Date Conversion: Ensures that the ‘Date’ column is in datetime format, removing invalid entries.

    2. Latest Date Identification: Finds the most recent date in the dataset that is on or before the execution date.

    3. Date Range Calculation: Sets the end date as the latest date and the start date as six days prior, encompassing a full week.

  • Error Handling: Raises exceptions if no valid dates are found or if no records exist up to the execution date.

Usage: This function ensures that the weekly report accurately reflects the most recent week’s data, maintaining the report’s relevance and timeliness.

c. Crime Data Filtering

Code
def filter_crime_data(df, start_date, end_date):
    """
    Filters the DataFrame to entries between start_date and end_date, inclusive.
    """
    start_dt = pd.to_datetime(start_date)
    end_dt   = pd.to_datetime(end_date) + timedelta(days=1) - timedelta(seconds=1)
    mask = (df['Date'] >= start_dt) & (df['Date'] <= end_dt)
    return df.loc[mask].copy()

Explanation:

  • Purpose: Selects crime records that fall within the specified date range.

  • Process:

    • Date Conversion: Converts start_date and end_date to datetime objects.

    • Mask Creation: Creates a boolean mask to filter records where the ‘Date’ falls within the range.

    • Data Selection: Applies the mask to the DataFrame and returns a copy of the filtered data.

Usage: This function isolates the relevant crime data for the weekly report, ensuring that only pertinent records are included in the generated maps and CSV reports.

d. Crime Data Loading

Code
def load_all_crimes(zip_file_path):
    """
    Reads summary_report.zip containing your full crime data (summary_report.csv).
    """
    if not zip_file_path.exists():
        raise FileNotFoundError(f"Could not find zip file '{zip_file_path}'")

    with zipfile.ZipFile(zip_file_path, 'r') as z:
        with z.open('summary_report.csv') as csvfile:
            df = pd.read_csv(csvfile, encoding='cp1252', on_bad_lines='skip')

    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df = df[df['Date'].notna()]
    df.sort_values(by='Date', ascending=False, inplace=True)
    return df

Explanation:

  • Purpose: Loads the complete set of crime data from a compressed ZIP file containing summary_report.csv.

  • Process:

    1. ZIP Extraction: Opens the ZIP file and extracts the CSV file.

    2. CSV Reading: Reads the CSV data into a Pandas DataFrame, skipping bad lines to ensure data integrity.

    3. Date Validation: Converts the ‘Date’ column to datetime format and removes records with invalid dates.

    4. Data Sorting: Sorts the DataFrame in descending order based on the ‘Date’ to prioritize recent records.

  • Error Handling: Raises a FileNotFoundError if the ZIP file does not exist.

Usage: This function provides a reliable method to access the entire crime dataset, forming the basis for generating both weekly and cumulative reports.

e. Safe Field Handling

Code
def safe_field(value):
    """
    Return the string version of a field or 'Not found' if it's missing/NaN/empty.
    """
    if pd.isnull(value) or value == "":
        return "Not found"
    return str(value)

Explanation:

  • Purpose: Ensures that all fields included in the reports are present and readable. If a field is missing (NaN) or empty, it returns “Not found” to maintain consistency in the reports.

f. Cumulative Map Creation

Code
def create_folium_map_cumulative(
    df,
    lat_col='Lat',
    lng_col='Long',
    offense_col='Offense',
    date_col='Date',
    output_html_path='cumulative_map.html'
):
    """
    Creates a Folium map plotting all records in df up to a certain date with disclaimers overlay (JS + HTML) 
    and saves it to 'output_html_path'.
    """
    oak_park_center = [41.885, -87.78]
    crime_map = folium.Map(location=oak_park_center, zoom_start=11)  # Zoomed out for cumulative view

    marker_cluster = MarkerCluster().add_to(crime_map)
    # Define the static part of the base URL
    base_url_static = 'https://www.oak-park.us/sites/default/files/police/summaries/'

    for _, row in df.iterrows():
        lat = row[lat_col]
        lng = row[lng_col]
        offense = row.get(offense_col, "Unknown")
        complaint = safe_field(row.get('Complaint #'))
        offense_val = safe_field(offense)
        date_str = safe_field(row['Date'].strftime('%Y-%m-%d') if pd.notnull(row['Date']) else np.nan)
        time_val = safe_field(row.get('Time'))
        location = safe_field(row.get('Location'))
        victim = safe_field(row.get('Victim/Address'))
        narrative = safe_field(row.get('Narrative'))
        filename = safe_field(row.get('File Name'))

        popup_html = f"""
            <b>Complaint #:</b> {complaint}<br/>
            <b>Offense:</b> {offense_val}<br/>
            <b>Date:</b> {date_str}<br/>
            <details>
              <summary><b>View Details</b></summary>
              <b>Time:</b> {time_val}<br/>
              <b>Location:</b> {location}<br/>
              <b>Victim:</b> {victim}<br/>
              <b>Narrative:</b> {narrative}<br/>
              <b>URL:</b> <a href="{filename}" target="_blank">PDF Link</a>
            </details>
        """

        folium.Marker(
            location=[lat, lng],
            popup=folium.Popup(popup_html, max_width=300),
            icon=folium.Icon(color='blue', icon='info-sign')  # Different color for distinction
        ).add_to(marker_cluster)

    # Basic map title
    title_html = '''
    <h3 align="center" style="font-size:20px"><b>Oak Park Cumulative Crime Map</b></h3>
    <br>
    <h3 align="center" style="font-size:10px">
    <a href="https://jesse-anderson.net/">My Portfolio</a> |
    <a href="https://blog.jesse-anderson.net/">My Blog</a> |
    <a href="https://blog.jesse-anderson.net/">Documentation</a> |
    <a href="mailto:jesse@jesse-anderson.net">Contact</a> |
    <a href="https://forms.gle/GnyaVwo1Vzm8nBH6A">
        Add me to Weekly Updates
    </a>
    </h3>
'''
    crime_map.get_root().html.add_child(folium.Element(title_html))

    # Overlays disclaimers in a "splash screen" with JavaScript (reuse existing disclaimer)
    disclaimers_overlay = """
    <style>
    /* Full-page overlay styling */
    #disclaimerOverlay {
      position: fixed;
      z-index: 9999; /* On top of everything */
      left: 0;
      top: 0;
      width: 100%;
      height: 100%;
      background-color: rgba(255, 255, 255, 0.95);
      color: #333;
      display: block; /* Visible by default */
      overflow: auto;
      text-align: center;
      padding-top: 100px;
      font-family: Arial, sans-serif;
    }
    #disclaimerContent {
      background: #f9f9f9;
      border: 1px solid #ccc;
      display: inline-block;
      padding: 20px;
      max-width: 800px;
      text-align: left;
    }
    #acceptButton {
      margin-top: 20px;
      padding: 10px 20px;
      font-size: 16px;
      cursor: pointer;
    }
    </style>

    <div id="disclaimerOverlay">
      <div id="disclaimerContent">
        <h2>Important Legal Disclaimer</h2>
        <p><strong>By using this demonstrative research tool, you acknowledge and agree:</strong></p>
        <ul>
            <li>This tool is for <strong>demonstration purposes only</strong>.</li>
            <li>The data originated from publicly available Oak Park Police Department PDF files.
                View the official site here: 
                <a href="https://www.oak-park.us/village-services/police-department"
                   target="_blank">Oak Park Police Department</a>.</li>
            <li>During parsing, <strong>~10%</strong> of complaints were <strong>omitted</strong> 
                due to parsing issues; thus the data is <strong>incomplete</strong>.</li>
            <li>The <strong>official</strong> and <strong>complete</strong> PDF files remain 
                with the Oak Park Police Department.</li>
            <li>You <strong>will not hold</strong> the author <strong>liable</strong> for <strong>any</strong> 
                decisions—formal or informal—based on this tool.</li>
            <li>This tool <strong>should not</strong> be used in <strong>any</strong> official or unofficial 
                <strong>decision-making</strong>.</li>
        </ul>
        <p><strong>By continuing, you indicate your acceptance of these terms 
           and disclaim all liability.</strong></p>
        <hr/>
        <button id="acceptButton" onclick="hideOverlay()">I Accept</button>
      </div>
    </div>

    <script>
    function hideOverlay() {
      var overlay = document.getElementById('disclaimerOverlay');
      overlay.style.display = 'none'; 
    }
    </script>
    """

    disclaimers_element = folium.Element(disclaimers_overlay)
    crime_map.get_root().html.add_child(disclaimers_element)

    # Save final HTML
    crime_map.save(str(output_html_path))

Explanation:

  • Map Initialization:

    • Centering: Centers the Folium map on Oak Park’s geographical coordinates with a zoom level of 11 for a broader view in the cumulative map.
  • Marker Creation:

    • Customization: Each crime incident is represented by a blue marker with an info-sign icon for distinction from the weekly map’s red markers.

    • Popups: Similar to the weekly map, but tailored for cumulative data.

  • Title Addition:

    • HTML Styling: Adds a title and navigation links directly onto the Folium map using HTML and inline CSS.
  • Disclaimer Overlay:

    • Reuse: Reuses the same disclaimer overlay mechanism as in the weekly map to ensure consistency.
  • Map Saving:

    • Output: Saves the generated cumulative map with overlays to the specified HTML file path.

g. Gmail API Service Setup

Code
def get_gmail_service():
    """
    Authenticates the user and returns the Gmail API service.
    """
    creds = None
    token_path = Path('token.json')

    if token_path.exists():
        creds = Credentials.from_authorized_user_file(str(token_path), SCOPES)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            try:
                creds.refresh(Request())
                logging.info("Credentials refreshed successfully.")
            except Exception as e:
                logging.error(f"Error refreshing credentials: {e}")
                print(f"[ERROR] Could not refresh credentials: {e}")
                return None
        else:
            try:
                flow = InstalledAppFlow.from_client_secrets_file(
                    'credentials.json', SCOPES
                )
                creds = flow.run_local_server(port=0)
                logging.info("Authentication flow completed successfully.")
            except Exception as e:
                logging.error(f"Error during OAuth flow: {e}")
                print(f"[ERROR] Could not complete OAuth flow: {e}")
                return None

        try:
            with open(token_path, 'w') as token:
                token.write(creds.to_json())
                logging.info("Credentials saved to token.json.")
        except Exception as e:
            logging.error(f"Failed to save credentials: {e}")
            print(f"[ERROR] Could not save credentials: {e}")

    try:
        service = build('gmail', 'v1', credentials=creds)
        logging.info("Gmail service created successfully.")
        return service
    except HttpError as error:
        logging.error(f"An error occurred while building Gmail service: {error}")
        print(f"[ERROR] An error occurred while building Gmail service: {error}")
        return None

Explanation:

  • Purpose: Authenticates the user and establishes a connection with the Gmail API to send emails.

  • Authentication Flow:

    1. Token Check: Looks for existing credentials in token.json.

    2. Token Refresh: If credentials are expired but have a refresh token, it attempts to refresh them.

    3. OAuth Flow: If no valid credentials exist, initiates the OAuth flow using credentials.json.

    4. Credential Saving: Saves the new or refreshed credentials back to token.json for future use.

  • Error Handling: Logs and prints errors encountered during authentication or service creation.

Usage: This function ensures secure and authenticated access to the Gmail API, enabling the script to send automated emails containing the latest crime reports.

h. Email Sending Function

Code
def send_email_with_disclaimer_and_links(
    service,
    sender_email,
    to_emails,
    subject,
    body_text,
    attachments
):
    """
    Sends an email with disclaimers and links using the Gmail API.
    """
    try:
        message = MIMEMultipart()
        message['to'] = "Undisclosed Recipients <jesse@jesse-anderson.net>"
        message['subject'] = subject
        message['from'] = sender_email

        # Define the plain text content
        plain_text = f"""
        Important Legal Disclaimer
        
        By using this demonstrative research tool, you acknowledge and agree:

        - This tool is for demonstration purposes only.
        - The data originated from publicly available Oak Park Police Department PDF files.
          View the official site here: https://www.oak-park.us/village-services/police-department.
        - During parsing, ~10% of total complaints since 2018 were omitted due to parsing issues; 
          thus the data is incomplete.
        - The official and complete PDF files remain with the Oak Park Police Department.
        - You will not hold the author liable for any decisions—formal or informal—based on this tool.
        - This tool should not be used in any official or unofficial decision-making.

        By continuing, you indicate your acceptance of these terms and disclaim all liability.

        ------------

        Hello,
        The crime report from {body_text['start_date']} to {body_text['end_date']} is attached as a .csv file.

        Interactive map:
        {body_text['weekly_map_url']}

        Cumulative map:
        {body_text['cumulative_map_url']}
        
        Last week's data:
        {body_text['csv_url']}
        """

        part1 = MIMEText(plain_text, 'plain')
        message.attach(part1)

        bcc_emails = ", ".join(to_emails)
        message['bcc'] = bcc_emails

        # Attach the CSV
        for file_path in attachments:
            file_path = Path(file_path)
            if not file_path.exists():
                logging.warning(f"Attachment '{file_path}' not found, skipping.")
                continue
            with open(file_path, 'rb') as f:
                mime_application = MIMEApplication(f.read(), Name=file_path.name)
            mime_application['Content-Disposition'] = f'attachment; filename="{file_path.name}"'
            message.attach(mime_application)

        raw_message = base64.urlsafe_b64encode(message.as_bytes()).decode()
        body = {'raw': raw_message}

        sent_message = service.users().messages().send(userId="me", body=body).execute()
        logging.info(f"Email sent successfully. Message ID: {sent_message['id']}")
        print("Crime report email sent successfully.")
    except Exception as e:
        logging.error(f"Failed to send email: {e}")
        print(f"[ERROR] Failed to send email: {e}")

Explanation:

  • Purpose: Composes and sends an email containing the latest crime reports and links to interactive maps.

  • Structure:

    • Email Composition:

      • Plain Text: Provides a disclaimer and details about the latest crime reports.

      • Attachments: Attaches the filtered CSV report.

    • BCC: Sends the email to undisclosed recipients to maintain privacy.

  • Encoding: Encodes the email content in base64 to comply with Gmail API requirements.

  • Sending: Utilizes the authenticated Gmail service to send the email.

  • Error Handling: Logs and prints errors encountered during the email sending process.

i. Main Report Generation Workflow

Code
def main_report_generation():
    """
    Executes the full pipeline:
    1. Load environment variables
    2. Load & filter data for the last 7 days
    3. Create & save Folium map with disclaimers overlay
    4. Upload HTML to GitHub
    5. Send email with CSV attached
    """
    start_time = time.time()
    script_dir = Path(__file__).parent.resolve()

    env_file_path = script_dir / "env_vars.txt"
    try:
        load_env_vars(env_file_path)
    except FileNotFoundError as e:
        print(f"[ERROR] {e}")
        return

    sender_email = os.getenv("SENDER_EMAIL")
    if not sender_email:
        raise ValueError("Missing SENDER_EMAIL in env_vars.txt")

    logging.basicConfig(
        filename=script_dir / 'full_crime_report.log',
        level=logging.DEBUG,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )

    data_dir = script_dir / 'data'
    map_dir = script_dir / 'generated_maps'
    # recipients_csv = script_dir / 'recipients.csv'
    csv_dir = script_dir / 'generated_csvs'  # New directory for CSVs
    map_dir.mkdir(parents=True, exist_ok=True)
    csv_dir.mkdir(parents=True, exist_ok=True)

    zip_file_path = data_dir / 'summary_report.zip'

    execution_date = datetime.now().date()
    date_str = execution_date.strftime('%Y-%m-%d')

    # CSV & HTML output
    filtered_subset_filename = f'filtered_subset_{date_str}.csv'
    filtered_subset_path = csv_dir / filtered_subset_filename
    weekly_map_output_filename = f'crime_map_weekly_{date_str}.html'
    weekly_map_output_path = map_dir / weekly_map_output_filename
    cumulative_map_output_filename = f'crime_map_cumulative.html'
    cumulative_map_output_path = map_dir / cumulative_map_output_filename

    # Local GitHub Pages folder
    github_repo_path = os.getenv("GITHUB_REPO")
    github_repo_path = Path(github_repo_path)

    # (C) Load data
    try:
        df_full = load_all_crimes(zip_file_path)
    except Exception as e:
        logging.error(f"Failed to load all crimes: {e}")
        print(f"[ERROR] Could not load all crimes: {e}")
        return

    if df_full.empty:
        logging.info("No crime data—no map or CSV generated.")
        print("No crime data—no map or CSV generated.")
        return

    # (D) Filter last 7 days
    try:
        start_date, end_date = determine_date_range(df_full, execution_date)
        df_filtered = filter_crime_data(df_full, start_date, end_date)
    except Exception as e:
        logging.error(f"Error determining date range: {e}")
        print(f"[ERROR] Could not determine date range: {e}")
        return

    if df_filtered.empty:
        logging.info("No crimes found—no map or CSV generated.")
        print("No crimes found in the determined date range—no map or CSV generated.")
        return

    # (E) Write filtered CSV
    try:
        df_filtered.to_csv(filtered_subset_path, index=False, encoding="cp1252")
        logging.info(f"Filtered data written to {filtered_subset_path}")
    except Exception as e:
        logging.error(f"Failed to write filtered data to CSV: {e}")
        print(f"[ERROR] Could not write filtered data to CSV: {e}")
        return

    # (F) Create Folium map with disclaimers overlay
    try:
        create_folium_map_filtered_data(
            df=df_filtered,
            lat_col='Lat',
            lng_col='Long',
            offense_col='Offense',
            date_col='Date',
            output_html_path=weekly_map_output_path
        )
        logging.info(f"Folium map created at {weekly_map_output_path}")
    except Exception as e:
        logging.error(f"Error creating Folium map: {e}")
        print(f"[ERROR] Could not create Folium map: {e}")
        return

    try:
        create_folium_map_cumulative(
            df=df_full,  # Use the full dataset for cumulative map
            lat_col='Lat',
            lng_col='Long',
            offense_col='Offense',
            date_col='Date',
            output_html_path=cumulative_map_output_path
        )
        logging.info(f"Cumulative Folium map created at {cumulative_map_output_path}")
    except Exception as e:
        logging.error(f"Error creating cumulative Folium map: {e}")
        print(f"[ERROR] Could not create cumulative Folium map: {e}")
        return
    
    test = False
    if not test:
        # (G) Upload Map and csv
        try:
            # Upload Maps
            files_to_upload_maps = [weekly_map_output_path, cumulative_map_output_path]
            upload_files_to_github_batch(
                file_paths=files_to_upload_maps,
                github_repo_path=github_repo_path,
                target_subfolder='OP-Crime-Maps'
            )
            time.sleep(5) #paranoia
            # Upload CSV
            files_to_upload_csv = filtered_subset_path
            upload_file_to_github(
                file_path=files_to_upload_csv,
                github_repo_path=github_repo_path,
                target_subfolder='OP-Crime-Data'
            )
        except Exception as e:
            logging.error(f"Failed to upload files to GitHub: {e}")
            print(f"[ERROR] Could not upload files to GitHub: {e}")
            return
        # (I) Generate GitHub URLs for the uploaded files
        # Assuming GitHub Pages are served from the root of the repository
        github_base_url = "https://jesse-anderson.github.io"

        weekly_map_url = f"{github_base_url}/OP-Crime-Maps/{weekly_map_output_filename}"
        cumulative_map_url = f"{github_base_url}/OP-Crime-Maps/{cumulative_map_output_filename}"
        csv_url = f"{github_base_url}/OP-Crime-Data/{filtered_subset_filename}"

        # (J) Gmail API & Email
        try:
            service = get_gmail_service()
            if not service:
                raise Exception("Failed to create Gmail service.")
        except Exception as e:
            logging.error(f"Authentication failed: {e}")
            print(f"[ERROR] Authentication failed: {e}")
            return
        time.sleep(60) #time to build github pages....
        # (K) Load Recipients
        try:
            # to_list = load_recipients_list(recipients_csv)
            # to_list = get_mailchimp_subscribers()
            to_list = ["myemail@gmail.com"]
        except FileNotFoundError as e:
            logging.error(f"Error loading recipients: {e}")
            print(f"[ERROR] {e}")
            return

        if not to_list:
            logging.warning("No recipients found—cannot send email.")
            print("No recipients found in recipients.csv—cannot send email.")
            return

        subject = f"Crime Report from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}"
        body_text = {
            'start_date': start_date.strftime('%Y-%m-%d'),
            'end_date': end_date.strftime('%Y-%m-%d'),
            'weekly_map_url': weekly_map_url,
            'cumulative_map_url': cumulative_map_url,
            'csv_url': csv_url
        }
        attachments = []

        try:
            send_email_with_disclaimer_and_links(
                service=service,
                sender_email=sender_email,
                to_emails=to_list,
                subject=subject,
                body_text=body_text,
                attachments=attachments
            )
        except Exception as e:
            logging.error(f"Failed to send email: {e}")
            print(f"[ERROR] Could not send email: {e}")

    end_time = time.time()
    elapsed_sec = end_time - start_time
    logging.info(f"Finished full_crime_report in {elapsed_sec:.2f} seconds.")
    print(f"Finished full_crime_report in {elapsed_sec:.2f} seconds.")

Explanation:

  • Pipeline Steps:

    1. Environment Variables: Loads necessary configurations like API keys and GitHub repository paths.

    2. Data Loading: Retrieves the complete set of crime data from summary_report.zip.

    3. Data Filtering: Selects records from the past seven days to focus the weekly report.

    4. CSV Generation: Exports the filtered data to a CSV file for easy access and dissemination.

    5. Map Creation:

      • Weekly Map: Generates an interactive map highlighting crimes from the past week.

      • Cumulative Map: Generates a comprehensive map showcasing all recorded crimes.

    6. GitHub Upload: Pushes the newly generated HTML maps and CSV reports to designated GitHub repository subfolders.

    7. Email Preparation and Sending:

      • Gmail API Authentication: Ensures secure access to the Gmail API for sending emails.

      • Recipient Loading: Retrieves the list of subscribers.

      • Email Composition: Crafts an email containing links to the latest reports and attaches the CSV file.

      • Email Dispatch: Sends the email to all subscribers.

    8. Logging: Records the entire process’s execution details, including any errors and execution time.

  • Testing Mode:

    • Flag: The test variable allows toggling between testing and production modes. When test = False, the script proceeds with uploading and emailing. This is useful for development and debugging without affecting live data or subscribers.
  • Error Handling:

    • Try-Except Blocks: Enclose critical operations to catch and log exceptions, ensuring the script fails gracefully and provides informative error messages.

Conclusion of Static HTML Generation Section

The Static HTML Generation component plays a crucial role in ensuring that the Oak Park Crime Reporting project delivers timely and accessible information. By automating the creation of detailed reports and maps, integrating seamlessly with GitHub for hosting, and utilizing the Gmail API for dissemination, the script ensures that stakeholders and interested parties receive up-to-date crime data efficiently. Robust error handling and logging mechanisms further enhance the reliability and maintainability of the reporting pipeline.


Conclusion

The Oak Park Crime Reporting project is a comprehensive solution that automates the extraction, processing, visualization, and dissemination of crime data within Oak Park. By leveraging powerful Python libraries and integrations with platforms like GitHub and Gmail, the project ensures that crime data is both accessible and actionable. The meticulous documentation of each component—Data Parsing, Live Streamlit Dashboard, and Static HTML Generation—provides a clear roadmap for understanding, maintaining, and potentially expanding the project’s capabilities.


Future Enhancements

While the current implementation of the Oak Park Crime Reporting project is robust, there are several avenues for future enhancements:

  1. Enhanced Data Parsing:

    • Machine Learning Integration: Implement machine learning models to improve the accuracy of data extraction from PDFs, especially for complex narratives. Currently the narratives have random spaces which does not lend itself to easy reading, but manually processing 7,000+ records and/or spending more time coding does not seem logical.

    • Automated Data Validation: Introduce automated checks to validate the integrity and consistency of the parsed data.

  2. Dashboard Improvements:

    • Advanced Visualization: Incorporate additional visualization tools like heatmaps, trend graphs, and statistical summaries.

    • User Authentication: Add authentication layers to restrict access to sensitive data and functionalities.

    • Real-Time Updates: Enable real-time data streaming to keep the dashboard updated without manual interventions.

  3. Static Report Enhancements:

    • Interactive Elements: Introduce interactive components within the static HTML reports for better user engagement such as advanced filtering.

    • Responsive Design: Ensure that the static reports are mobile-friendly and adapt seamlessly to various screen sizes.

  4. Email System Upgrades:

    • Personalization: Personalize emails based on user preferences and subscription tiers. such as daily, weekly, monthly, quarterly, and yearly.

    • Email Scheduling: Implement scheduled email dispatches to send reports at predefined intervals automatically beyond simple Cronjobs.

  5. Scalability and Performance:

    • Cloud Deployment: Migrate components to cloud platforms for better scalability, reliability, and performance. This is currently not implemented as that would cost money and the balance is such that this project is not bringing in any funds.

    • Database Integration: Utilize databases like PostgreSQL or MongoDB for more efficient data storage and querying once the storage records begin to exceed 10,000.

  6. Security Enhancements:

    • Data Encryption: Encrypt sensitive data both at rest and in transit to ensure data privacy. While this is already done, it can always be improved upon.

    • Secure Authentication: Strengthen authentication mechanisms for API integrations and user access. Already done, but improvement is relentless.

  7. User Feedback Mechanism:

    • Feedback Forms: Incorporate feedback forms within the dashboard and reports to gather user insights and suggestions.

    • Analytics: Implement analytics to monitor user interactions and improve the platform based on usage patterns.

By pursuing these enhancements, the Oak Park Crime Reporting project can evolve into an even more powerful tool, providing deeper insights and greater value to its users.


I hope you have enjoyed reading this documentation and sincerely hope you got something out of it.

Jesse

Support Page

Support my work with a Coffee/Monster

Share