Jump to content

User:TheWeeklyIslander

From Wikipedia, the free encyclopedia

Hello,

I'm TheWeeklyIslander. My name was inspired by The Colorado Kid by Stephen King, where there is a fictitious newspaper called The Weekly Islander. Contrary to the name, I have been on very few islands in my life, let alone weekly.

If you are on my page, it is likely due to an update I made to the demographic section of an American place (city, town, CDP, village, etc.). You can update too! Just follow the section below!

Open-Source Demographics Generating Tool

[edit]

Hello,

We are coming up on four years of not having accurate demographics data for much of the United States - a country which has been rapidly diversifying. Demographics has a massive impact on the United States. The only thing worse than not having any demographics data is having inaccurate demographics data, which is a plight I have seen for many places that do have demographics sections. They may also not be properly cited, which is something I have also sought to fix. This tool gathers data from the Census Bureau and exports the data to text files that can be copy and pasted to Wikipedia through source editing.

Updating the demographics data for the entire United States is too much for one person and I decided to make a guided user interface for anyone to use. I modified it to be compatible with Google Colab for accessibility and ease of use. Google Colab is free and the only requirement is a Google account, and you may access it at https://colab.research.google.com.

To run the scripts, you will need a few inputs:

  1. A Census Bureau API Key. This is free and you can register for one on https://api.census.gov/data/key_signup.html. You do not need to be part of an organization.
  2. The Gazetteer files for 2020. Those can be found here: https://www.census.gov/geographies/reference-files/2020/geo/gazetter-file.html.
    • I only built my tool to handle County, County Subdivision, and Place data, so you may have to alter the scripts to handle other areas such as tract, congressional district, etc.
  3. Convert the Gazetteer files to .csv format using Excel, because that is what I wrote my scripts to handle. Do not convert the values when doing this!

Instructions:

  1. Acquire the inputs as stated up above.
  2. Create an account on Google Colab or use your personal Jupyter Notebook (I haven't tried the latter, but if you have a JN then you likely can figure it out).
  3. Copy and paste Cell 1 into the first cell.
  4. Copy and paste Cell 2 into the second cell.
  5. Copy and paste Cell 3 into the third cell.
  6. Run Cell 1 and at the bottom, just past the cell, you will see a scrollable section that appears that has a text box, an upload button, another text box, and a series of checkboxes labelled by state.
    • The first text box is for the API Key you applied for.
    • The upload button is for the Gazetteer .csv file that you made.
    • The second text box is for the output directory. I recommend typing "/content".
    • The checkboxes are for the states you would like to generate the demographic information for. There is a convenient "Select All" button if you are ambitious, otherwise just choose the state(s) you like.
  7. Once all of the parts of 6 have been entered, hit the green "Generate Demographics" button at the end of the list of checkboxes of states. This will create a JSON file in the "/content" directory.
  8. Run Cell 2. If you scroll to the bottom of Cell 2, you will find that it is printing estimated time range for completion and the name of the place that is being generated at the time. It is parallelized as best I could, and I found that it would generate all areas in ~13 hours, or about 1 place every half second.
  9. Run Cell 3, this will create .zip folders for each of the states you selected in step 6 and download them to your computer. Note: You do not need to wait for step 8 for be completed before hitting the run button on Cell 3. This way you may leave it running while you do other things.
  10. Edit Wikipedia to your heart's content.

Limitations/Notes:

This is the hobby project of a tired grad student, not a programmer, so please be understanding if you don't find this tool to be perfect. I have noticed there are a couple limitations and I will place them here:

  • I have commented extensively. One of the biggest comments is of which Census Bureau tables and variables I used. This is in the pursuit of full transparency, so if there are any issues, I hope the community can find and fix them. I know there are some people who have done similar Python exercises, but I haven't found their scripts posted. I manually checked a few places throughout the U.S. with the Census Bureau tables I used, but with nearly 66,000 areas across places, counties, and county subdivisions, there is just no way I can check everywhere.
  • Biggest issue I expect to hear, and so I hope I don't now that I'm addressing it, is that it doesn't look like any places are being generated. I set a limit on the population size of places that the script would generate information for. Places below 25 people are skipped over, but this can be fixed by flipping this:
 
if population<25:
        return

to this:

if population>25:
        return

or removing it altogether from Cell 2. I also removed the ability to generate county subdivision places that have "District" or "Precinct" in their name. These fixes were to minimize the likelihood of throwing errors, and it looks like it has worked.

  • This script only works for data from Places, Counties, and County Subdivisions in the 50 states of the U.S., this does not currently work for tracts, congressional districts, etc.
  • Being this only works on those places, it does not handle Washington, D.C. nor any territories, like Puerto Rico. The Census Bureau has this data somewhere and you are welcome to modify these scripts to handle those data.
  • If your version of Excel is too old (like 2010), it will not properly encode the data when converting the Gazetteer file to csv.
  • You will find that certain cities are generated twice if you generate places and county subdivisions. One text file is named the city's name, the other is the city's name and county. These are the same data as far as I have checked, but you may check for yourself. The reason I have this in there and did not want to remove it is because in New England and the mid-Atlantic, there were a lot of townships, towns, etc., that were being skipped over because of a way that these places were being recorded, such as South Kingston, Rhode Island. When I realized these places were in the county subdivisions, I decided it would be better to have duplicates than exclusions, especially with the parallelization. This was one of the major reasons I did not release this sooner.
  • I believe I fixed this, but there were times where massively negative numbers would be received from the census tables if there was no data for small population areas - numbers like -$333,333,333, or all 2's or all 6's. I believe I fixed this, but don't be afraid to mention it. Please note that this only affects the ACS data and not the decennial census data, so you can just cut it out.
  • Do not trust that just because a place has a demographics section that it is accurate. You would be surprised how often this is not the case.
  • This script does not generate the tables that people really like to put up instead of a demographics section, but all of the data necessary to build one is included in these scripts.
  • Inclusion of ACS data: I don't personally view this as a limitation, but I have noticed some people disagree about ACS data being included in my demographics sections. I am aware that the ACS is not the decennial census; as evidenced by pouring over all of these data tables as compiling them here. The ACS data is used primarily in the last paragraph of the demographics sections that details income and poverty data, as well as for the estimation of bachelor's degree holders in the second to last paragraph. If you do not like this data/believe it should not be included, either change the section to say "2020" or remove the text from the Python script. However, I believe this is important data that should be included, and it was included in the 2000 decennial census, shifting to the ACS when it was created in 2005 to simplify the decennial census, but more importantly to provide frequent up-to-date economic information. I used the 5-year data because it is more reliable for small areas and is aggregated over a five year period.
  • This is the link that inspired me and gave the basic background to use an API with Python to build this tool.[1] Thanks to Michael McManus for his guide!
  • If there are any errors, I will very intermittently work to fix them or discuss with members of the community. Apologies, but I have commitments that take precedence over editing Wikipedia.

Cell 1

[edit]
"""
@author: TheWeeklyIslander
"""
import os
import json
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from google.colab import files


class StateSelectionApp:
    def __init__(self):
        self.states = [
            "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware",
            "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
            "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
            "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
            "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
            "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington",
            "West Virginia", "Wisconsin", "Wyoming"
        ]

        # Widgets for Census Bureau API Key
        self.census_id_widget = widgets.Text(
            description="API Key:",
            placeholder="Enter Census Bureau API Key",
            layout=widgets.Layout(width="400px")
        )

        # Output directory
        self.output_dir_widget = widgets.Text(
            description="Output Dir:",
            placeholder="/content/output_dir",
            layout=widgets.Layout(width="400px")
        )

        # File upload widget for Gazette CSV
        self.upload_widget = widgets.FileUpload(
            accept=".csv",
            multiple=False
        )

        # Checkboxes for State Selection
        self.state_checkboxes = {state: widgets.Checkbox(value=False, description=state) for state in self.states}
        self.select_all_checkbox = widgets.Checkbox(value=False, description="Select All")
        self.select_all_checkbox.observe(self.select_all_states, names='value')
        self.state_checkboxes_container = widgets.VBox(list(self.state_checkboxes.values()))

        # Buttons
        self.generate_button = widgets.Button(
            description="Generate Demographics",
            button_style="success",
            tooltip="Generate demographics based on input",
            icon="check"
        )
        self.reset_button = widgets.Button(
            description="Reset",
            button_style="danger",
            tooltip="Reset all selections",
            icon="times"
        )
        self.generate_button.on_click(self.generate_demographics_button)
        self.reset_button.on_click(self.reset_selections)

        # Display Widgets
        self.display_widgets()

    def display_widgets(self):
        # Display all widgets in a layout
        display(widgets.HTML("<h2>State Selection Tool</h2>"))
        display(widgets.VBox([
            widgets.HTML("<b>Enter Census Bureau API Key:</b>"), self.census_id_widget,
            widgets.HTML("<b>Upload Gazette CSV File:</b>"), self.upload_widget,
            widgets.HTML("<b>Enter Output Directory:</b>"), self.output_dir_widget,
            widgets.HTML("<b>Select States:</b>"), self.select_all_checkbox, self.state_checkboxes_container,
            widgets.HBox([self.generate_button, self.reset_button])
        ]))

    def select_all_states(self, change):
        # Select or deselect all states based on the 'Select All' checkbox
        for checkbox in self.state_checkboxes.values():
            checkbox.value = change.new

    def reset_selections(self, _):
        # Reset all selections
        self.census_id_widget.value = ""
        self.output_dir_widget.value = "/content/output_dir"
        self.select_all_checkbox.value = False
        for checkbox in self.state_checkboxes.values():
            checkbox.value = False

    def generate_demographics_button(self, _):
        # Collect user inputs
        census_id = self.census_id_widget.value
        selected_states = [state for state, checkbox in self.state_checkboxes.items() if checkbox.value]
        uploaded_files = list(self.upload_widget.value.values())
        output_dir = self.output_dir_widget.value.strip()

        # Validate inputs
        if not census_id:
            print("Error: Census Bureau API Key is required.")
            return
        if not selected_states:
            print("Error: At least one state must be selected.")
            return
        if not uploaded_files:
            print("Error: Gazette CSV file is required.")
            return
        if not output_dir:
            print("Error: Output directory is required.")
            return

        # Save the uploaded Gazette file
        # Save the uploaded Gazette file using its original name
        uploaded_file_name = list(self.upload_widget.value.keys())[0]  # Get the name of the uploaded file
        gazette_file_path = os.path.join("/content", uploaded_file_name)
        with open(gazette_file_path, "wb") as f:
            f.write(uploaded_files[0]['content'])

        # Save inputs into a JSON file
        config = {
            "census_id": census_id,
            "gazette_file": gazette_file_path,
            "output_dir": output_dir,
            "selected_states": selected_states,
        }
        config_file_path = os.path.join("/content", "demographics_config.json")
        with open(config_file_path, "w") as json_file:
            json.dump(config, json_file, indent=4)

        print(f"Configuration saved to {config_file_path}.")
        print("You can now call the script to process demographics.")

StateSelectionApp()

Cell 2

[edit]
import subprocess
import os
import sys
import json
import os
import re
from datetime import datetime, timedelta
import pytz
import pandas as pd
import requests
import numpy as np
import time
import json
import concurrent.futures
import sys
from datetime import date

temp_json_file = "/content/demographics_config.json"

        # Load the input variables from the JSON file
with open(temp_json_file, "r") as f:
        config = json.load(f)

census_id = config.get("census_id")
gazette_file = config.get("gazette_file")
output_dir = config.get("output_dir")
selected_states = config.get("selected_states")

def get_dataframe_from_query(query):
    """
    Helper function to fetch data from the API and return a DataFrame.
    """
    response = requests.get(query)
    if response.status_code == 200:
        data = json.loads(response.text)
        return pd.DataFrame.from_dict(data).T
    else:
        print(f"Request failed with status code {response.status_code}")
        return None


def generate_demographics_for_chunk(chunk, total_places):
    """
    Worker function to process a chunk of the DataFrame.
    """
    for i, row in chunk.iterrows():
        try:
            process_place(i, row, total_places)
        except Exception as e:
            print(f"Error processing row {row['GEOID']}: {e}")

def process_place(i, row, total_places):
    """
    Function to process a single place and generate demographics.
    """
    # Extract GEOID and determine the query location
    geoid = row['GEOID']
    # Determine query parameters based on GEOID length
    if len(geoid) == 10:
        StateFIPS = geoid[:2]
        CountyFIPS = geoid[2:5]
        SubdivisionFIPS = geoid[5:]
        location = f'&for=county%20subdivision:{SubdivisionFIPS}&in=state:{StateFIPS}&in=county:{CountyFIPS}'
    elif len(geoid) == 7:
        StateFIPS = geoid[:2]
        PlaceFIPS = geoid[2:]
        location = f'&for=place:{PlaceFIPS}&in=state:{StateFIPS}'
    elif len(geoid) == 5:
        StateFIPS = geoid[:2]
        CountyFIPS = geoid[2:]
        location = f'&for=county:{CountyFIPS}&in=state:{StateFIPS}'
    else:
        print(f"Invalid GEOID length for {geoid}. Skipping.")
        return

    today = date.today()
    formatted_date = today.strftime("%m-%d-%Y")
    #gazette_file = os.path.join(input_dir, "2020_Places_Combined_with_Counties.csv")
    state = get_state_name_from_fips(StateFIPS)

    # Prepare queries
    host = 'https://api.census.gov/data'
    year = '/2020'
    dataset_acronym = '/dec/pl'
    variables = 'NAME,P1_001N'
    usr_key = f"&key={census_id}"
    query = f"{host}{year}{dataset_acronym}?get={variables}{location}{usr_key}"

    host = 'https://api.census.gov/data'
    year = '/2020'
    dataset_acronym_2020census = '/dec/pl'
    dataset_acronym_2020acs5 = '/acs/acs5/subject'
    dataset_acronym_2020dp = '/dec/dp'
    dataset_acronym_2020dhc = '/dec/dhc'
    g = '?get='
    variables_2020census = 'NAME,P1_001N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,P1_008N,P1_009N,P2_002N,P2_005N' #H1_002N
    variables_2020acs5 = 'NAME,S1101_C01_002E,S1101_C01_004E,S1101_C01_003E,S1903_C03_001E,S1903_C03_001M,S1903_C03_015E,S1903_C03_015M,S2001_C03_002E,S2001_C03_002M,S2001_C05_002E,S2001_C05_002M,S2001_C01_002E,S2001_C01_002M,S1702_C02_001E,S1701_C03_001E,S1701_C03_002E,S1701_C03_010E,S1501_C01_005E,S1501_C01_015E'
    variables_2020dp = 'NAME,DP1_0002C,DP1_0003C,DP1_0004C,DP1_0005C,DP1_0006C,DP1_0007C,DP1_0008C,DP1_0009C,DP1_0010C,DP1_0011C,DP1_0012C,DP1_0013C,DP1_0014C,DP1_0015C,DP1_0016C,DP1_0017C,DP1_0018C,DP1_0019C,DP1_0021C,DP1_0073C,DP1_0025C,DP1_0049C,DP1_0069C,DP1_0045C,DP1_0133C,DP1_0142C,DP1_0138C,DP1_0139C,DP1_0143C,DP1_0141C,DP1_0147C,DP1_0132C,DP1_0145C'
    variables_2020dhc = 'NAME,P16_002N'
    usr_key = f"&key={census_id}" #Put it all together in one f-string:
    query_2020census = f"{host}{year}{dataset_acronym_2020census}{g}{variables_2020census}{location}{usr_key}"# Use requests package to call out to the API
    query_2020acs5 = f"{host}{year}{dataset_acronym_2020acs5}{g}{variables_2020acs5}{location}{usr_key}"
    query_2020dp = f"{host}{year}{dataset_acronym_2020dp}{g}{variables_2020dp}{location}{usr_key}"
    query_2020dhc = f"{host}{year}{dataset_acronym_2020dhc}{g}{variables_2020dhc}{location}{usr_key}"
    queries = [
    ("2020 Census", query_2020census),
    ("2020 ACS5", query_2020acs5),
    ("2020 DP", query_2020dp),
    ("2020 DHC", query_2020dhc),
]

    # Make API request
    # Query and response handling for 2020 Census
    response_2020census = requests.get(query_2020census)
    if response_2020census.status_code == 200:
        try:
            alpha = response_2020census.text
            beta = json.loads(alpha)
            df_2020census = pd.DataFrame.from_dict(beta)
            df_2020census = df_2020census.T
        except Exception as e:
            print(f"Error processing 2020 Census data for GEOID {geoid}: {e}")
    else:
        print(f"Failed to fetch 2020 Census data for GEOID {geoid}: {response_2020census.status_code}")

    # Query and response handling for 2020 ACS5
    response_2020acs5 = requests.get(query_2020acs5)
    if response_2020acs5.status_code == 200:
        try:
            gamma = response_2020acs5.text
            delta = json.loads(gamma)
            df_2020acs5 = pd.DataFrame.from_dict(delta)
            df_2020acs5 = df_2020acs5.T
        except Exception as e:
            print(f"Error processing 2020 ACS5 data for GEOID {geoid}: {e}")
    else:
        print(f"Failed to fetch 2020 ACS5 data for GEOID {geoid}: {response_2020acs5.status_code}")

    # Query and response handling for 2020 DP
    response_2020dp = requests.get(query_2020dp)
    if response_2020dp.status_code == 200:
        try:
            epsilon = response_2020dp.text
            iota = json.loads(epsilon)
            df_2020dp = pd.DataFrame.from_dict(iota)
            df_2020dp = df_2020dp.T
        except Exception as e:
            print(f"Error processing 2020 DP data for GEOID {geoid}: {e}")
    else:
        print(f"Failed to fetch 2020 DP data for GEOID {geoid}: {response_2020dp.status_code}")

    # Query and response handling for 2020 DHC
    response_2020dhc = requests.get(query_2020dhc)
    if response_2020dhc.status_code == 200:
        try:
            theta = response_2020dhc.text
            zeta = json.loads(theta)
            df_2020dhc = pd.DataFrame.from_dict(zeta)
            df_2020dhc = df_2020dhc.T
        except Exception as e:
            print(f"Error processing 2020 DHC data for GEOID {geoid}: {e}")
    else:
        print(f"Failed to fetch 2020 DHC data for GEOID {geoid}: {response_2020dhc.status_code}")

    population= df_2020census[1][1] #P1_001N
    population = float(population)
    if population<25:
        return
    cityname = df_2020census[1][0]
    if "district" in cityname.lower() and "district of columbia" not in cityname.lower():
        return  # Skip this iteration of the loop
    city = process_place_string(cityname)

    writtendirectory = output_dir+ '/{}'.format(state)
    if not os.path.exists(writtendirectory):
        os.makedirs(writtendirectory)

    numberwhite=df_2020census[1][2] #P1_003N
    numberblack= df_2020census[1][3]#P1_004N
    numbernative= df_2020census[1][4]#P1_005N
    numberasian= df_2020census[1][5]#P1_006N
    numberpacificislander = df_2020census[1][6]#P1_007N
    numberotherrace= df_2020census[1][7]#P1_008N
    numbertwoormorerace= df_2020census[1][8] #P1_009N
    numberhispanic= df_2020census[1][9] #P2_002N
    numbernonhispanicwhite= df_2020census[1][10] #P2_005N

    popunder5 = df_2020dp[1][1]#DP1_0002C
    pop5to9 = df_2020dp[1][2]#DP1_0003C
    pop10to14 = df_2020dp[1][3]#DP1_0004C
    pop15to19 = df_2020dp[1][4]#DP1_0005C
    pop20to24 = df_2020dp[1][5]#DP1_0006C
    pop25to29 = df_2020dp[1][6]#DP1_0007C
    pop30to34 = df_2020dp[1][7]#DP1_0008C
    pop35to39 = df_2020dp[1][8]#DP1_0009C
    pop40to44 = df_2020dp[1][9]#DP1_0010C
    pop45to49 = df_2020dp[1][10]#DP1_0011C
    pop50to54 = df_2020dp[1][11]#DP1_0012C
    pop55to59 = df_2020dp[1][12]#DP1_0013C
    pop60to64 = df_2020dp[1][13]#DP1_0014C
    pop65to69 = df_2020dp[1][14]#DP1_0015C
    pop70to74 = df_2020dp[1][15]#DP1_0016C
    pop75to79 = df_2020dp[1][16]#DP1_0017C
    pop80to84 = df_2020dp[1][17]#DP1_0018C
    pop85plus = df_2020dp[1][18]#DP1_0019C
    popover18 = df_2020dp[1][19]#DP1_0021C

    popunder5 = float(popunder5)
    pop5to9 = float(pop5to9)
    pop10to14=float(pop10to14)
    pop15to19=float(pop15to19)
    pop20to24=float(pop20to24)
    pop25to29=float(pop25to29)
    pop30to34=float(pop30to34)
    pop35to39=float(pop35to39)
    pop40to44=float(pop40to44)
    pop45to49=float(pop45to49)
    pop50to54=float(pop50to54)
    pop55to59=float(pop55to59)
    pop60to64=float(pop60to64)
    pop65to69=float(pop65to69)
    pop70to74=float(pop70to74)
    pop75to79=float(pop75to79)
    pop80to84=float(pop80to84)
    pop85plus=float(pop85plus)
    popover18=float(popover18)

    popunder18 = population - popover18

    pop18to24 = pop20to24 + pop15to19 + popunder5 + pop5to9 + pop10to14 - popunder18
    pop25to44 = pop25to29 + pop30to34 + pop35to39 + pop40to44
    pop45to64 = pop45to49 + pop50to54 + pop55to59 + pop60to64
    pop65plus = pop65to69 + pop70to74 + pop75to79 + pop80to84 + pop85plus

    medianage= df_2020dp[1][20]#DP1_0073C
    malepopulation = df_2020dp[1][21]#DP1_0025C
    femalepopulation = df_2020dp[1][22]#DP1_0049C
    femalepopulation18plus = df_2020dp[1][23]#DP1_0069C
    malepopulation18plus = df_2020dp[1][24]#DP1_0045C

    medianage=float(medianage)
    malepopulation=float(malepopulation)
    if malepopulation == 0:
        return
    femalepopulation = float(femalepopulation)
    if femalepopulation == 0:
        return
    femalepopulation18plus=float(femalepopulation18plus)
    malepopulation18plus=float(malepopulation18plus)
    if malepopulation18plus == 0:
        return
    if femalepopulation18plus == 0:
        return

    femaletomaleratio= (femalepopulation/malepopulation)*100
    femaletomaleratio = round(femaletomaleratio,1)
    femaletomaleratio18plus= (femalepopulation18plus/malepopulation18plus)*100
    femaletomaleratio18plus = round(femaletomaleratio18plus,1)

    marriedcouples =df_2020dp[1][25]#DP1_0133C
    femalelivingalone = df_2020dp[1][26]#DP1_0142C
    malelivingalone = df_2020dp[1][27]#DP1_0138C
    malelivingalone65plus = df_2020dp[1][28]#DP1_0139C
    femalelivingalone65plus = df_2020dp[1][29]#DP1_0143C
    femalehouseholder = df_2020dp[1][30]#DP1_0141C

    numberofhousingunits= df_2020dp[1][31]#DP1_0147C
    totalhouseholds = df_2020dp[1][32]#DP1_0132C
    under18households= df_2020dp[1][33]#DP1_0145C
    avghouseholdsize= df_2020acs5[1][1]#S1101_C01_002E
    avgfamilysize= df_2020acs5[1][2]#S1101_C01_004E
    totalfamilies = df_2020dhc[1][1]#P16_002N
    medianhouseholdincome= df_2020acs5[1][4]#S1903_C03_001E
    medianhouseholdincomestd= df_2020acs5[1][5]#S1903_C03_001M
    medianfamilyincome= df_2020acs5[1][6]#S1903_C03_015E
    medianfamilyincomestd= df_2020acs5[1][7]#S1903_C03_015M
    medianmaleincome= df_2020acs5[1][8]#S2001_C03_002E
    medianmaleincomestd= df_2020acs5[1][9]#S2001_C03_002M
    medianfemaleincome= df_2020acs5[1][10]#S2001_C05_002E
    medianfemaleincomestd= df_2020acs5[1][11]#S2001_C05_002M
    percapitaincome= df_2020acs5[1][12]#S2001_C01_002E
    percapitaincomestd= df_2020acs5[1][13]#S2001_C01_002M
    percentpovertyfamily= df_2020acs5[1][14]#S1702_C02_001E
    percentpovertypopulation= df_2020acs5[1][15]#S1701_C03_001E
    percentpoverty18= df_2020acs5[1][16]#S1701_C03_002E
    percentpoverty65= df_2020acs5[1][17]#S1701_C03_010E

    medianhouseholdincome=int(medianhouseholdincome)
    medianhouseholdincomestd = int(medianhouseholdincomestd)
    medianfamilyincome=int(medianfamilyincome)
    medianfamilyincomestd=int(medianfamilyincomestd)
    medianmaleincome=int(medianmaleincome)
    medianmaleincomestd=int(medianmaleincomestd)
    medianfemaleincome=int(medianfemaleincome)
    medianfemaleincomestd = int(medianfemaleincomestd)
    percapitaincome=int(percapitaincome)
    percapitaincomestd=int(percapitaincomestd)

    bachelordegrees18to24 = df_2020acs5[1][18]#S1501_C01_005E
    bachelordegrees18to24=float(bachelordegrees18to24)
    bachelordegrees25plus = df_2020acs5[1][19]#S1501_C01_015E
    bachelordegrees25plus=float(bachelordegrees25plus)
    bachelordegreestotal = bachelordegrees18to24+bachelordegrees25plus

    population = int(population)

    # so = wptools.page('{}, {}'.format(city,state)).get_parse()
    # infobox = so.data['infobox']

    areami = row['ALAND_SQMI']
    areami = float(areami)
    areakm = areami*2.59

    populationdensitymi = population/areami
    populationdensitymi = round(populationdensitymi,1)
    populationdensitykm = population/areakm
    populationdensitykm = round(populationdensitykm,1)

    numberofhousingunits = int(numberofhousingunits)

    housingunitdensitymi = numberofhousingunits/areami
    housingunitdensitymi = round(housingunitdensitymi,1)
    housingunitdensitykm = numberofhousingunits/areakm
    housingunitdensitykm = round(housingunitdensitykm,1)

    numberwhite = int(numberwhite)
    numberblack = int(numberblack)
    numberasian = int(numberasian)
    numbernative = int(numbernative)
    numberpacificislander = int(numberpacificislander)
    numberotherrace = int(numberotherrace)
    numbertwoormorerace = int(numbertwoormorerace)
    numberhispanic = int(numberhispanic)
    numbernonhispanicwhite = int(numbernonhispanicwhite)

    percentwhite = 100*(numberwhite/population)
    percentwhite = round(percentwhite,2)
    percentblack = 100*(numberblack/population)
    percentblack = round(percentblack,2)
    percentasian = 100*(numberasian/population)
    percentasian = round(percentasian,2)
    percentnative = 100*(numbernative/population)
    percentnative = round(percentnative,2)
    percentpacific = 100*(numberpacificislander/population)
    percentpacific = round(percentpacific,2)
    percentotherraces = 100*(numberotherrace/population)
    percentotherraces = round(percentotherraces,2)
    percenttwoormoreraces = 100*(numbertwoormorerace/population)
    percenttwoormoreraces = round(percenttwoormoreraces,2)
    percenthispanic = 100*(numberhispanic/population)
    percenthispanic = round(percenthispanic,2)
    percentnonhispanicwhite = 100*(numbernonhispanicwhite/population)
    percentnonhispanicwhite = round(percentnonhispanicwhite,2)

    totalhouseholds = float(totalhouseholds)
    totalfamilies = float(totalfamilies)
    under18households = float(under18households)
    marriedcouples = float(marriedcouples)
    if marriedcouples <= 0:
        return

    percentmarriedcouples = 100*(marriedcouples/totalhouseholds)
    percentmarriedcouples = round(percentmarriedcouples,1)
    percentunder18households = 100*(under18households/totalhouseholds)
    percentunder18households = round(percentunder18households,1)

    malelivingalone = float(malelivingalone)
    femalelivingalone = float(femalelivingalone)
    femalehouseholder = float(femalehouseholder)
    percentfemalehouseholder = 100*(femalehouseholder/totalhouseholds)
    percentfemalehouseholder = round(percentfemalehouseholder,1)
    livingalone = malelivingalone + femalelivingalone
    percentlivingalone = 100*(livingalone/totalhouseholds)
    percentlivingalone = round(percentlivingalone,1)
    malelivingalone65plus = float(malelivingalone65plus)
    femalelivingalone65plus = float(femalelivingalone65plus)
    livingalone65plus = malelivingalone65plus + femalelivingalone65plus
    livingalone65plus = float(livingalone65plus)
    percentlivingalone65plus = 100*(livingalone65plus/totalhouseholds)
    percentlivingalone65plus = round(percentlivingalone65plus,1)

    avghouseholdsize = float(avghouseholdsize)
    avghouseholdsize = round(avghouseholdsize,1)
    avgfamilysize = float(avgfamilysize)
    avgfamilysize = round(avgfamilysize,1)

    percentpopunder18 = 100*(popunder18/population)
    percentpopunder18 = round(percentpopunder18,1)
    percentpop18to24 = 100*(pop18to24/population)
    percentpop18to24 = round(percentpop18to24,1)
    percentpop25to44 = 100*(pop25to44/population)
    percentpop25to44 = round(percentpop25to44,1)
    percentpop45to64 = 100*(pop45to64/population)
    percentpop45to64 = round(percentpop45to64,1)
    percentpop65plus = 100*(pop65plus/population)
    percentpop65plus = round(percentpop65plus,1)

    percentbachelordegrees = 100*(bachelordegreestotal/population)
    percentbachelordegrees = round(percentbachelordegrees,1)

    totalhouseholds = int(totalhouseholds)
    totalhouseholds = format(totalhouseholds, ",")

    population = format(population, ",")

    populationdensitymi = format(populationdensitymi, ",")
    populationdensitykm = format(populationdensitykm, ",")

    numberofhousingunits = format(numberofhousingunits,",")

    numberwhite = format(numberwhite,",")
    numberblack = format(numberblack,",")
    numberasian = format(numberasian,",")
    numbernative = format(numbernative,",")
    numberpacificislander = format(numberpacificislander,",")
    numberotherrace = format(numberotherrace,",")
    numbertwoormorerace = format(numbertwoormorerace,",")
    numberhispanic = format(numberhispanic,",")
    numbernonhispanicwhite = format(numbernonhispanicwhite,",")

    housingunitdensitykm = format(housingunitdensitykm,",")
    housingunitdensitymi = format(housingunitdensitymi,",")

    medianhouseholdincome = format(medianhouseholdincome,",")
    medianfemaleincome = format(medianfemaleincome,",")
    medianfemaleincomestd=format(medianfemaleincomestd,",")
    percapitaincome=format(percapitaincome,",")
    percapitaincomestd=format(percapitaincomestd,",")
    medianhouseholdincomestd=format(medianhouseholdincomestd,",")
    medianfamilyincome=format(medianfamilyincome,",")
    medianfamilyincomestd=format(medianfamilyincomestd,",")
    medianmaleincome=format(medianmaleincome,",")
    medianmaleincomestd=format(medianmaleincomestd,",")

    totalfamilies = int(totalfamilies)
    totalfamilies = format(totalfamilies, ",")
    outputtextfilename = cityname
    cityname = cityname.replace(" ","%20")

    line23 = '===2020 census==='
    line24 = '\n'
    line1 = "The [[2020 United States census|2020 United States census]] counted %s people, %s households, and %s families " % (population, totalhouseholds, totalfamilies)
    line2 = "in {}.<ref>{{{{Cite web |title=US Census Bureau, Table P16: HOUSEHOLD TYPE |url=https://data.census.gov/table?q={}%20p16&y=2020 |access-date={} |website=data.census.gov}}}}</ref><ref name="":0"" />".format(city,cityname,formatted_date)
    line22 = " The population density was %s per square mile (%s/km{{sup|2}})." % (populationdensitymi, populationdensitykm)
    line3 = " There were %s housing units at an average density of %s per square mile (%s/km{{sup|2}})." % (numberofhousingunits,housingunitdensitymi, housingunitdensitykm)
    line21 = "<ref name="":0"">{{{{Cite web |title=US Census Bureau, Table DP1: PROFILE OF GENERAL POPULATION AND HOUSING CHARACTERISTICS |url=https://data.census.gov/table/DECENNIALDP2020.DP1?q={}%20dp1 |access-date={} |website=data.census.gov}}}}</ref><ref>{{{{Cite web |last=Bureau |first=US Census |title=Gazetteer Files |url=https://www.census.gov/geographies/reference-files/2020/geo/gazetter-file.html |access-date=2023-12-30 |website=Census.gov}}}}</ref> ".format(cityname,formatted_date)
    line4 = "The racial makeup was {}% ({}) [[White (U.S. Census)|white]] or [[European American|European American]] ({}% [[Non-Hispanic White|non-Hispanic white]]), {}% ({}) [[African American (U.S. Census)|black]] or [[African American|African-American]], {}% ({}) [[Native American (U.S. Census)|Native American]] or [[Alaska Native|Alaska Native]], {}% ({}) [[Asian (U.S. Census)|Asian]], {}% ({}) [[Pacific Islander (U.S. Census)|Pacific Islander]] or [[Native Hawaiian|Native Hawaiian]], ".format(percentwhite,numberwhite,percentnonhispanicwhite,percentblack,numberblack,percentnative,numbernative,percentasian,numberasian,percentpacific,numberpacificislander)
    line5 = "{}% ({}) from [[Race (United States Census)|other races]], and {}% ({}) from [[Multiracial Americans|two or more races]].<ref>{{{{Cite web |title=US Census Bureau, Table P1: RACE |url=https://data.census.gov/table/DECENNIALPL2020.P1?q={}%20p1&y=2020 |access-date={} |website=data.census.gov}}}}</ref> [[Hispanic (U.S. Census)|Hispanic]] or [[Latino (U.S. Census)|Latino]] of any race was {}% ({}) of the population.<ref>{{{{Cite web |title=US Census Bureau, Table P2: HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE |url=https://data.census.gov/table/DECENNIALPL2020.P2?q={}%20p2&y=2020 |access-date={} |website=data.census.gov}}}}</ref>".format(percentotherraces,numberotherrace,percenttwoormoreraces,numbertwoormorerace,cityname,formatted_date,percenthispanic,numberhispanic,cityname,formatted_date)
    line6 = "\n"
    line7 = "\n"
    line8 = "Of the {} households, {}% had children under the age of 18; {}% were married couples living together; {}% had a female householder with no".format(totalhouseholds,percentunder18households,percentmarriedcouples, percentfemalehouseholder)
    line9 = " spouse or partner present. {}% of households consisted of individuals and {}% had someone ".format(percentlivingalone,percentlivingalone65plus)
    line10 = "living alone who was 65 years of age or older.<ref name="":0"" /> The average household size was {} and the average family size was {}.<ref>{{{{Cite web |title=US Census Bureau, Table S1101: HOUSEHOLDS AND FAMILIES |url=https://data.census.gov/table/ACSST5Y2020.S1101?q={}%20s1101%20&y=2020 |access-date={} |website=data.census.gov}}}}</ref> The percent of those with a bachelor’s degree or higher was estimated to be {}% of the population.<ref>{{{{Cite web |title=US Census Bureau, Table S1501: EDUCATIONAL ATTAINMENT |url=https://data.census.gov/table/ACSST5Y2020.S1501?q={}%20s1501%20&y=2020 |access-date={} |website=data.census.gov}}}}</ref>".format(avghouseholdsize,avgfamilysize,cityname,formatted_date,percentbachelordegrees,cityname,formatted_date)
    line11 = "\n"
    line12 = "\n"
    line13 = "{}% of the population was under the age of 18, {}% from 18 to 24, {}% from 25 to 44, {}% from 45 to 64, and {}% who were 65 years of age or older.".format(percentpopunder18,percentpop18to24,percentpop25to44,percentpop45to64,percentpop65plus)
    line14 = " The median age was {} years. For every 100 females, there were {} males.<ref name="":0"" /> For every 100 females ages 18 and older, there were {} males.<ref name="":0"" />".format(medianage,femaletomaleratio,femaletomaleratio18plus)
    line15 = "\n"
    line16 = "\n"

    # Define invalid values for income fields
    invalid_values = {"-666,666,666", "-222,222,222", "-333,333,333","-666666666.0"}

    # Determine the main combined text
    if all(
        value in invalid_values
        for value in [
            medianhouseholdincome,
            medianhouseholdincomestd,
            medianfamilyincome,
            medianfamilyincomestd,
        ]
    ):
        combined_text = (
            "The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that "
            # "no valid income data is available.<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME "
            # "IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={}%20s1903%20&y=2020 "
            # "|access-date={} |website=data.census.gov}}</ref>".format(cityname, formatted_date)
        )
    else:
        # Handle household income
        if medianhouseholdincome not in invalid_values:
            medianhouseholdincome_numeric = int(medianhouseholdincome.replace(",",""))
            if medianhouseholdincomestd not in invalid_values:
                household_income_text = (
                    "The median household income was ${} (with a margin of error of +/- ${}).".format(
                        medianhouseholdincome, medianhouseholdincomestd
                    )
                )
            elif medianhouseholdincome_numeric < 250001:
                household_income_text = "The median household income was ${}.".format(
                    medianhouseholdincome
                )
            elif medianhouseholdincome_numeric >= 250001:
                household_income_text = "The median household income was greater than $250,000."
        else:
            household_income_text = ""

        # Handle family income
        if medianfamilyincome not in invalid_values:
            medianfamilyincome_numeric = int(medianfamilyincome.replace(",", ""))
            if medianfamilyincomestd not in invalid_values:
                family_income_text = (
                    " The median family income was ${} (+/- ${}).".format(
                        medianfamilyincome, medianfamilyincomestd
                    )
                )
            elif medianfamilyincome_numeric < 250001:
                family_income_text = " The median family income was ${}.".format(
                    medianfamilyincome
                )
            elif medianfamilyincome_numeric >= 250001:
                family_income_text = " The median family income was greater than $250,000."

        else:
            family_income_text = ""

        if family_income_text != "" and household_income_text != "":
            combined_text = (
                f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {household_income_text}{family_income_text}"
                f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
                f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
                f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
                )
        elif family_income_text != "" and household_income_text == "":
            combined_text = (
                f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {family_income_text}"
                f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
                f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
                f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
                )
        elif family_income_text == "" and household_income_text != "":
            combined_text = (
                f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {household_income_text}"
                f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
                f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
                f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
                )
        elif family_income_text == "" and household_income_text == "":
            combined_text = (
                f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that "
                )

    # Gender income text
    if medianmaleincome in invalid_values and medianfemaleincome in invalid_values:
        gender_income_text = ""
    elif medianmaleincome not in invalid_values and medianfemaleincome not in invalid_values:
        if medianmaleincomestd in invalid_values and medianfemaleincomestd in invalid_values:
            gender_income_text = " Males had a median income of ${} versus ${} for females.".format(
                medianmaleincome, medianfemaleincome
            )
        elif medianmaleincomestd in invalid_values:
            gender_income_text = " Males had a median income of ${} versus ${} (+/- ${}) for females.".format(
                medianmaleincome, medianfemaleincome, medianfemaleincomestd
            )
        elif medianfemaleincomestd in invalid_values:
            gender_income_text = " Males had a median income of ${} (+/- ${}) versus ${} for females.".format(
                medianmaleincome, medianmaleincomestd, medianfemaleincome
            )
        else:
            gender_income_text = " Males had a median income of ${} (+/- ${}) versus ${} (+/- ${}) for females.".format(
                medianmaleincome, medianmaleincomestd, medianfemaleincome, medianfemaleincomestd
            )
    elif medianmaleincome in invalid_values:
        if medianfemaleincomestd in invalid_values:
            gender_income_text = " Females had a median income of ${}.".format(medianfemaleincome)
        else:
            gender_income_text = " Females had a median income of ${} (+/- ${}).".format(
                medianfemaleincome, medianfemaleincomestd
            )
    elif medianfemaleincome in invalid_values:
        if medianmaleincomestd in invalid_values:
            gender_income_text = " Males had a median income of ${}.".format(medianmaleincome)
        else:
            gender_income_text = " Males had a median income of ${} (+/- ${}).".format(
                medianmaleincome, medianmaleincomestd
            )


    # Per capita income text
    per_capita_income_text = (
        ""
        if percapitaincome in invalid_values
        else " The median income for those above 16 years old was ${} (+/- ${}).<ref>{{{{Cite web |title=US Census Bureau, Table S2001: "
        "EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)|url=https://data.census.gov/table/ACSST5Y2020.S2001?q={}%20s2001%20&y=2020 "
        "|access-date={} |website=data.census.gov}}}}</ref>".format(
            percapitaincome, percapitaincomestd, cityname, formatted_date
        )
        if percapitaincomestd not in invalid_values
        else " The median income for those above 16 years old was ${}.<ref>{{{{Cite web |title=US Census Bureau, Table S2001: "
        "EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)|url=https://data.census.gov/table/ACSST5Y2020.S2001?q={}%20s2001%20&y=2020 "
        "|access-date={} |website=data.census.gov}}}}</ref>".format(
            percapitaincome, cityname, formatted_date
        )
    )


    # Poverty text
    if all(
        value in invalid_values
        for value in [
            percentpovertyfamily,
            percentpovertypopulation,
            percentpoverty18,
            percentpoverty65,
        ]
    ):
        poverty_text = ""  # Exclude poverty text entirely if all values are invalid
    else:
        # Handle individual cases and permutations
        family_text = (
            f"{percentpovertyfamily}% of families"
            if percentpovertyfamily not in invalid_values
            else ""
        )
        population_text = (
            f"{percentpovertypopulation}% of the population"
            if percentpovertypopulation not in invalid_values
            else ""
        )
        under_18_text = (
            f"{percentpoverty18}% of those under the age of 18"
            if percentpoverty18 not in invalid_values
            else ""
        )
        over_65_text = (
            f"{percentpoverty65}% of those ages 65 or over"
            if percentpoverty65 not in invalid_values
            else ""
        )

        # Combine valid components dynamically
        main_components = [text for text in [family_text, population_text] if text]
        main_text = " and ".join(main_components)
        additional_components = [text for text in [under_18_text, over_65_text] if text]
        additional_text = " and ".join(additional_components)

        # Construct poverty_text dynamically
        if main_text and additional_text:
            poverty_text = (
                f" Approximately, {main_text} were below the [[poverty line]], including {additional_text}."
            )
        elif main_text:
            poverty_text = f" Approximately, {main_text} were below the [[poverty line]]."
        else:
            poverty_text = ""

        # Append references for poverty data if there's any text
        if poverty_text:
            poverty_text += (
                f"<ref>{{{{Cite web |title=US Census Bureau, Table S1701: POVERTY STATUS IN THE PAST 12 MONTHS |url=https://data.census.gov/table/ACSST5Y2020.S1701?q={cityname}%20s1701%20&y=2020 "
                f"|access-date={formatted_date} |website=data.census.gov}}}}</ref>"
                f"<ref>{{{{Cite web |title=US Census Bureau, Table S1702: POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES |url=https://data.census.gov/table/ACSST5Y2020.S1702?q={cityname}%20s1702&y=2020 "
                f"|access-date={formatted_date} |website=data.census.gov}}}}</ref>"
            )

    # Combine all texts
    final_text = f"{combined_text}{gender_income_text}{per_capita_income_text}{poverty_text}"

    # Apply corrections
    final_text = fix_space_after_ref(final_text)

    if "]]Estimates" in final_text:
        final_text = final_text.replace("]]Estimates", "]] estimates")

    # Fix "The" capitalization after "show that"
    if "show that The" in final_text:
        final_text = final_text.replace("show that The", "show that the")
    else:
        # Alternative checks if exact match fails
        snippet_start = final_text.find("show that")
        if snippet_start != -1:
            snippet = final_text[snippet_start:snippet_start + 20]  # Extract a snippet around "show that"
            print("DEBUG: Snippet around 'show that':", snippet)

        # Check with potential variations
        if "show that  The" in final_text:
            final_text = final_text.replace("show that  The", "show that the")
        elif "show that the" in final_text.lower():
            # Normalize capitalization where "show that the" exists in any casing
            final_text = final_text[:snippet_start] + final_text[snippet_start:].replace("The", "the", 1)

    # Fix double spaces
    while "  " in final_text:
        final_text = final_text.replace("  ", " ")

    # Fix percentages with extra spaces (e.g., "14. 7%" -> "14.7%")
    final_text = re.sub(r"(\d)\.\s+(\d)", r"\1.\2", final_text)

    # Trim spaces at the start and end of the text
    final_text = final_text.strip()

    # Fix spaces after </ref>
    final_text = fix_space_after_ref(final_text)

    line120 = line23+line24+line1+line2+line22+line3+line21+line4+line5+line6+line7+line8+line9+line10+line11+line12+line13+line14+line15+line16+final_text
    with open(writtendirectory + '/%s_Demographics.txt' % (outputtextfilename), 'w+') as text_file:
        print(f"{line120}", file=text_file)

    print(f"Processing: {outputtextfilename}")

    # Print remaining places
    #print(total_places - i - 1, "places left")

def generate_demographics(census_id, gazette_file, output_dir, selected_states):
    """
    Main function to generate demographics in parallel.
    """
    today = date.today()
    formatted_date = today.strftime("%m-%d-%Y")

    # Load gazette data and filter by selected states
    gazette_data = pd.read_csv(gazette_file, dtype=str)
    selected_states = get_abbreviations_from_selected_states(selected_states)
    filtered_df = gazette_data[gazette_data['USPS'].isin(selected_states)].copy()

    # Define Central Time timezone
    central_time = pytz.timezone('America/Chicago')

    total_places = len(filtered_df)
    # Calculate time range (50% and 70% of list length)
    lower_bound = total_places * 0.5
    upper_bound = total_places * 0.7

    # Get current time in Central Time
    current_time_utc = datetime.now(pytz.utc)  # Get current time in UTC
    current_time_ct = current_time_utc.astimezone(central_time)  # Convert to Central Time

    # Calculate expected completion times in Central Time
    completion_time_lower_ct = current_time_ct + timedelta(seconds=lower_bound)
    completion_time_upper_ct = current_time_ct + timedelta(seconds=upper_bound)

    print(f"Expected completion time range in Mountain Time: {completion_time_lower_ct} - {completion_time_upper_ct}")
    num_chunks = min(10, len(filtered_df))  # Use at most 10 chunks or fewer if the DataFrame is small
    chunk_size = max(1, len(filtered_df) // num_chunks)  # Ensure chunk size is at least 1

    chunks = [filtered_df.iloc[i:i + chunk_size] for i in range(0, len(filtered_df), chunk_size)]

    # Process chunks in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:
        executor.map(lambda chunk: generate_demographics_for_chunk(chunk, total_places), chunks)

def get_state_name_from_fips(state_fips):
    # Dictionary mapping StateFIPS codes to state names
    fips_to_state_name = {
        "01": "Alabama",
        "02": "Alaska",
        "04": "Arizona",
        "05": "Arkansas",
        "06": "California",
        "08": "Colorado",
        "09": "Connecticut",
        "10": "Delaware",
        "11": "District of Columbia",
        "12": "Florida",
        "13": "Georgia",
        "15": "Hawaii",
        "16": "Idaho",
        "17": "Illinois",
        "18": "Indiana",
        "19": "Iowa",
        "20": "Kansas",
        "21": "Kentucky",
        "22": "Louisiana",
        "23": "Maine",
        "24": "Maryland",
        "25": "Massachusetts",
        "26": "Michigan",
        "27": "Minnesota",
        "28": "Mississippi",
        "29": "Missouri",
        "30": "Montana",
        "31": "Nebraska",
        "32": "Nevada",
        "33": "New Hampshire",
        "34": "New Jersey",
        "35": "New Mexico",
        "36": "New York",
        "37": "North Carolina",
        "38": "North Dakota",
        "39": "Ohio",
        "40": "Oklahoma",
        "41": "Oregon",
        "42": "Pennsylvania",
        "44": "Rhode Island",
        "45": "South Carolina",
        "46": "South Dakota",
        "47": "Tennessee",
        "48": "Texas",
        "49": "Utah",
        "50": "Vermont",
        "51": "Virginia",
        "53": "Washington",
        "54": "West Virginia",
        "55": "Wisconsin",
        "56": "Wyoming",
    }

    # Return the state name from StateFIPS
    return fips_to_state_name.get(state_fips, "State FIPS code not found")


def get_abbreviations_from_selected_states(selected_states):
    # Dictionary mapping state names to abbreviations
    state_to_abbreviation = {
        "Alabama": "AL",
        "Alaska": "AK",
        "Arizona": "AZ",
        "Arkansas": "AR",
        "California": "CA",
        "Colorado": "CO",
        "Connecticut": "CT",
        "Delaware": "DE",
        "Florida": "FL",
        "Georgia": "GA",
        "Hawaii": "HI",
        "Idaho": "ID",
        "Illinois": "IL",
        "Indiana": "IN",
        "Iowa": "IA",
        "Kansas": "KS",
        "Kentucky": "KY",
        "Louisiana": "LA",
        "Maine": "ME",
        "Maryland": "MD",
        "Massachusetts": "MA",
        "Michigan": "MI",
        "Minnesota": "MN",
        "Mississippi": "MS",
        "Missouri": "MO",
        "Montana": "MT",
        "Nebraska": "NE",
        "Nevada": "NV",
        "New Hampshire": "NH",
        "New Jersey": "NJ",
        "New Mexico": "NM",
        "New York": "NY",
        "North Carolina": "NC",
        "North Dakota": "ND",
        "Ohio": "OH",
        "Oklahoma": "OK",
        "Oregon": "OR",
        "Pennsylvania": "PA",
        "Rhode Island": "RI",
        "South Carolina": "SC",
        "South Dakota": "SD",
        "Tennessee": "TN",
        "Texas": "TX",
        "Utah": "UT",
        "Vermont": "VT",
        "Virginia": "VA",
        "Washington": "WA",
        "West Virginia": "WV",
        "Wisconsin": "WI",
        "Wyoming": "WY",
    }

    # Create a list of abbreviations for the selected states
    abbreviations = [state_to_abbreviation.get(state, "State not found") for state in selected_states]

    return abbreviations

def process_place_string(place_string):
    # Skip specific substrings
    if "County subdivisions not defined" in place_string or "Municipio subdivision not defined" in place_string or "County subdivisions not defined" in place_string:
        return None

    # Split the string into words
    words = place_string.split()

    # Identify the cutoff point
    cutoff_index = 0
    for i, word in enumerate(words):
        # Check if the word is all caps (like an abbreviation)
        if word.isupper():
            break
        # Check if the word starts with a capital letter
        elif word[0].isupper():
            cutoff_index = i + 1
        else:
            break

    # Return the string up to the cutoff point
    return " ".join(words[:cutoff_index])

def fix_space_after_ref(text):
    """
    Adds a space after </ref> if the next character is not '<' or a space.
    """
    corrected_text = ""
    i = 0

    while i < len(text):
        if text[i:i+6] == "</ref>" and i+6 < len(text):
            next_char = text[i+6]
            if next_char != '<' and next_char != ' ':
                corrected_text += "</ref> "  # Add </ref> followed by a space
            else:
                corrected_text += "</ref>"  # Keep </ref> as-is
            i += 6  # Skip over "</ref>"
        else:
            corrected_text += text[i]  # Add the current character
            i += 1  # Move to the next character

    return corrected_text

def correct_random_capitalization_and_fix_spaces(text):
    """
    Corrects random capitalization and fixes spacing issues while preserving text within [[ ]] and <ref> </ref>.
    Ensures proper formatting for inline phrases and removes extra spaces.

    Args:
        text (str): The input text with potentially incorrect capitalization and spacing.

    Returns:
        str: Corrected text.
    """
    # Define patterns for preserving [[ ]] and <ref> tags
    patterns_to_ignore = r'(\[\[.*?\]\])|(<ref>.*?</ref>)'

    # Split text into parts to process or preserve
    parts = re.split(patterns_to_ignore, text)
    corrected_text = []

    for part in parts:
        if part is None:
            continue

        # Preserve parts within [[ ]] and <ref> as-is
        if re.match(patterns_to_ignore, part):
            corrected_text.append(part)
        else:
            # Remove double spaces and fix capitalization
            cleaned_part = re.sub(r'\s{2,}', ' ', part.strip())
            sentences = re.findall(r'[^.!?]*[.!?]?\s*', cleaned_part)
            corrected_sentences = []

            for i, sentence in enumerate(sentences):
                stripped_sentence = sentence.strip()
                if not stripped_sentence:
                    # Preserve empty spaces or breaks
                    corrected_sentences.append(sentence)
                    continue

                # Check for inline continuation (e.g., "show that the")
                if i > 0 and corrected_sentences[-1].strip().endswith(("that", "of", "for", "and", "or")):
                    corrected = stripped_sentence[0].lower() + stripped_sentence[1:]
                else:
                    # Standard capitalization for new sentences
                    corrected = stripped_sentence[0].upper() + stripped_sentence[1:]

                # Ensure specific terms retain proper casing (e.g., 'estimates')
                corrected = corrected.replace("Estimates", "estimates")

                corrected_sentences.append(corrected)

            corrected_text.append(' '.join(corrected_sentences))

    # Reassemble corrected parts and fix lingering double spaces
    final_text = ''.join(corrected_text).strip()
    final_text = re.sub(r'\s{2,}', ' ', final_text)

    # Fix spacing around punctuation (e.g., "14. 7%")
    final_text = re.sub(r'(\d)\.\s+(\d)', r'\1.\2', final_text)

    return final_text
import time
tic = time.time()
generate_demographics(census_id, gazette_file, output_dir, selected_states)
toc = time.time()
print(toc-tic,'seconds elapsed')

Cell 3

[edit]
import shutil
import json
import os
from google.colab import files

# Load the input variables from the JSON file
temp_json_file = "/content/demographics_config.json"  # Path to your JSON file
with open(temp_json_file, "r") as f:
    config = json.load(f)

# Extract selected states from the JSON
selected_states = config.get("selected_states", [])
output_dir = config.get("output_dir", "/content/output")

# Loop through each selected state and create a ZIP file
for state in selected_states:
    folder_to_download = os.path.join(output_dir, state)
    
    # Check if the folder exists before zipping
    if os.path.exists(folder_to_download):
        output_zip_file = f"{folder_to_download}.zip"
        
        # Compress the folder into a ZIP file
        shutil.make_archive(output_zip_file.replace(".zip", ""), 'zip', folder_to_download)
        
        # Download the ZIP file
        files.download(output_zip_file)
    else:
        print(f"Folder for state '{state}' not found. Skipping.")
  1. ^ McManus, Michael (22-01-2022). "Using the U.S. Census Bureau API with Python". {{cite web}}: Check date values in: |date= (help)