User:TheWeeklyIslander
Hello,
I'm TheWeeklyIslander. My name was inspired by The Colorado Kid by Stephen King, where there is a fictitious newspaper called The Weekly Islander. Contrary to the name, I have been on very few islands in my life, let alone weekly.
If you are on my page, it is likely due to an update I made to the demographic section of an American place (city, town, CDP, village, etc.). You can update too! Just follow the section below!
Open-Source Demographics Generating Tool
[edit]Hello,
We are coming up on four years of not having accurate demographics data for much of the United States - a country which has been rapidly diversifying. Demographics has a massive impact on the United States. The only thing worse than not having any demographics data is having inaccurate demographics data, which is a plight I have seen for many places that do have demographics sections. They may also not be properly cited, which is something I have also sought to fix. This tool gathers data from the Census Bureau and exports the data to text files that can be copy and pasted to Wikipedia through source editing.
Updating the demographics data for the entire United States is too much for one person and I decided to make a guided user interface for anyone to use. I modified it to be compatible with Google Colab for accessibility and ease of use. Google Colab is free and the only requirement is a Google account, and you may access it at https://colab.research.google.com.
To run the scripts, you will need a few inputs:
- A Census Bureau API Key. This is free and you can register for one on https://api.census.gov/data/key_signup.html. You do not need to be part of an organization.
- The Gazetteer files for 2020. Those can be found here: https://www.census.gov/geographies/reference-files/2020/geo/gazetter-file.html.
- I only built my tool to handle County, County Subdivision, and Place data, so you may have to alter the scripts to handle other areas such as tract, congressional district, etc.
- Convert the Gazetteer files to .csv format using Excel, because that is what I wrote my scripts to handle. Do not convert the values when doing this!
Instructions:
- Acquire the inputs as stated up above.
- Create an account on Google Colab or use your personal Jupyter Notebook (I haven't tried the latter, but if you have a JN then you likely can figure it out).
- Copy and paste Cell 1 into the first cell.
- Copy and paste Cell 2 into the second cell.
- Copy and paste Cell 3 into the third cell.
- Run Cell 1 and at the bottom, just past the cell, you will see a scrollable section that appears that has a text box, an upload button, another text box, and a series of checkboxes labelled by state.
- The first text box is for the API Key you applied for.
- The upload button is for the Gazetteer .csv file that you made.
- The second text box is for the output directory. I recommend typing "/content".
- The checkboxes are for the states you would like to generate the demographic information for. There is a convenient "Select All" button if you are ambitious, otherwise just choose the state(s) you like.
- Once all of the parts of 6 have been entered, hit the green "Generate Demographics" button at the end of the list of checkboxes of states. This will create a JSON file in the "/content" directory.
- Run Cell 2. If you scroll to the bottom of Cell 2, you will find that it is printing estimated time range for completion and the name of the place that is being generated at the time. It is parallelized as best I could, and I found that it would generate all areas in ~13 hours, or about 1 place every half second.
- Run Cell 3, this will create .zip folders for each of the states you selected in step 6 and download them to your computer. Note: You do not need to wait for step 8 for be completed before hitting the run button on Cell 3. This way you may leave it running while you do other things.
- Edit Wikipedia to your heart's content.
Limitations/Notes:
This is the hobby project of a tired grad student, not a programmer, so please be understanding if you don't find this tool to be perfect. I have noticed there are a couple limitations and I will place them here:
- I have commented extensively. One of the biggest comments is of which Census Bureau tables and variables I used. This is in the pursuit of full transparency, so if there are any issues, I hope the community can find and fix them. I know there are some people who have done similar Python exercises, but I haven't found their scripts posted. I manually checked a few places throughout the U.S. with the Census Bureau tables I used, but with nearly 66,000 areas across places, counties, and county subdivisions, there is just no way I can check everywhere.
- Biggest issue I expect to hear, and so I hope I don't now that I'm addressing it, is that it doesn't look like any places are being generated. I set a limit on the population size of places that the script would generate information for. Places below 25 people are skipped over, but this can be fixed by flipping this:
if population<25:
return
to this:
if population>25:
return
or removing it altogether from Cell 2. I also removed the ability to generate county subdivision places that have "District" or "Precinct" in their name. These fixes were to minimize the likelihood of throwing errors, and it looks like it has worked.
- This script only works for data from Places, Counties, and County Subdivisions in the 50 states of the U.S., this does not currently work for tracts, congressional districts, etc.
- Being this only works on those places, it does not handle Washington, D.C. nor any territories, like Puerto Rico. The Census Bureau has this data somewhere and you are welcome to modify these scripts to handle those data.
- If your version of Excel is too old (like 2010), it will not properly encode the data when converting the Gazetteer file to csv.
- You will find that certain cities are generated twice if you generate places and county subdivisions. One text file is named the city's name, the other is the city's name and county. These are the same data as far as I have checked, but you may check for yourself. The reason I have this in there and did not want to remove it is because in New England and the mid-Atlantic, there were a lot of townships, towns, etc., that were being skipped over because of a way that these places were being recorded, such as South Kingston, Rhode Island. When I realized these places were in the county subdivisions, I decided it would be better to have duplicates than exclusions, especially with the parallelization. This was one of the major reasons I did not release this sooner.
- I believe I fixed this, but there were times where massively negative numbers would be received from the census tables if there was no data for small population areas - numbers like -$333,333,333, or all 2's or all 6's. I believe I fixed this, but don't be afraid to mention it. Please note that this only affects the ACS data and not the decennial census data, so you can just cut it out.
- Do not trust that just because a place has a demographics section that it is accurate. You would be surprised how often this is not the case.
- This script does not generate the tables that people really like to put up instead of a demographics section, but all of the data necessary to build one is included in these scripts.
- Inclusion of ACS data: I don't personally view this as a limitation, but I have noticed some people disagree about ACS data being included in my demographics sections. I am aware that the ACS is not the decennial census; as evidenced by pouring over all of these data tables as compiling them here. The ACS data is used primarily in the last paragraph of the demographics sections that details income and poverty data, as well as for the estimation of bachelor's degree holders in the second to last paragraph. If you do not like this data/believe it should not be included, either change the section to say "2020" or remove the text from the Python script. However, I believe this is important data that should be included, and it was included in the 2000 decennial census, shifting to the ACS when it was created in 2005 to simplify the decennial census, but more importantly to provide frequent up-to-date economic information. I used the 5-year data because it is more reliable for small areas and is aggregated over a five year period.
- This is the link that inspired me and gave the basic background to use an API with Python to build this tool.[1] Thanks to Michael McManus for his guide!
- If there are any errors, I will very intermittently work to fix them or discuss with members of the community. Apologies, but I have commitments that take precedence over editing Wikipedia.
Cell 1
[edit]"""
@author: TheWeeklyIslander
"""
import os
import json
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from google.colab import files
class StateSelectionApp:
def __init__(self):
self.states = [
"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware",
"Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
"Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
"North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington",
"West Virginia", "Wisconsin", "Wyoming"
]
# Widgets for Census Bureau API Key
self.census_id_widget = widgets.Text(
description="API Key:",
placeholder="Enter Census Bureau API Key",
layout=widgets.Layout(width="400px")
)
# Output directory
self.output_dir_widget = widgets.Text(
description="Output Dir:",
placeholder="/content/output_dir",
layout=widgets.Layout(width="400px")
)
# File upload widget for Gazette CSV
self.upload_widget = widgets.FileUpload(
accept=".csv",
multiple=False
)
# Checkboxes for State Selection
self.state_checkboxes = {state: widgets.Checkbox(value=False, description=state) for state in self.states}
self.select_all_checkbox = widgets.Checkbox(value=False, description="Select All")
self.select_all_checkbox.observe(self.select_all_states, names='value')
self.state_checkboxes_container = widgets.VBox(list(self.state_checkboxes.values()))
# Buttons
self.generate_button = widgets.Button(
description="Generate Demographics",
button_style="success",
tooltip="Generate demographics based on input",
icon="check"
)
self.reset_button = widgets.Button(
description="Reset",
button_style="danger",
tooltip="Reset all selections",
icon="times"
)
self.generate_button.on_click(self.generate_demographics_button)
self.reset_button.on_click(self.reset_selections)
# Display Widgets
self.display_widgets()
def display_widgets(self):
# Display all widgets in a layout
display(widgets.HTML("<h2>State Selection Tool</h2>"))
display(widgets.VBox([
widgets.HTML("<b>Enter Census Bureau API Key:</b>"), self.census_id_widget,
widgets.HTML("<b>Upload Gazette CSV File:</b>"), self.upload_widget,
widgets.HTML("<b>Enter Output Directory:</b>"), self.output_dir_widget,
widgets.HTML("<b>Select States:</b>"), self.select_all_checkbox, self.state_checkboxes_container,
widgets.HBox([self.generate_button, self.reset_button])
]))
def select_all_states(self, change):
# Select or deselect all states based on the 'Select All' checkbox
for checkbox in self.state_checkboxes.values():
checkbox.value = change.new
def reset_selections(self, _):
# Reset all selections
self.census_id_widget.value = ""
self.output_dir_widget.value = "/content/output_dir"
self.select_all_checkbox.value = False
for checkbox in self.state_checkboxes.values():
checkbox.value = False
def generate_demographics_button(self, _):
# Collect user inputs
census_id = self.census_id_widget.value
selected_states = [state for state, checkbox in self.state_checkboxes.items() if checkbox.value]
uploaded_files = list(self.upload_widget.value.values())
output_dir = self.output_dir_widget.value.strip()
# Validate inputs
if not census_id:
print("Error: Census Bureau API Key is required.")
return
if not selected_states:
print("Error: At least one state must be selected.")
return
if not uploaded_files:
print("Error: Gazette CSV file is required.")
return
if not output_dir:
print("Error: Output directory is required.")
return
# Save the uploaded Gazette file
# Save the uploaded Gazette file using its original name
uploaded_file_name = list(self.upload_widget.value.keys())[0] # Get the name of the uploaded file
gazette_file_path = os.path.join("/content", uploaded_file_name)
with open(gazette_file_path, "wb") as f:
f.write(uploaded_files[0]['content'])
# Save inputs into a JSON file
config = {
"census_id": census_id,
"gazette_file": gazette_file_path,
"output_dir": output_dir,
"selected_states": selected_states,
}
config_file_path = os.path.join("/content", "demographics_config.json")
with open(config_file_path, "w") as json_file:
json.dump(config, json_file, indent=4)
print(f"Configuration saved to {config_file_path}.")
print("You can now call the script to process demographics.")
StateSelectionApp()
Cell 2
[edit]import subprocess
import os
import sys
import json
import os
import re
from datetime import datetime, timedelta
import pytz
import pandas as pd
import requests
import numpy as np
import time
import json
import concurrent.futures
import sys
from datetime import date
temp_json_file = "/content/demographics_config.json"
# Load the input variables from the JSON file
with open(temp_json_file, "r") as f:
config = json.load(f)
census_id = config.get("census_id")
gazette_file = config.get("gazette_file")
output_dir = config.get("output_dir")
selected_states = config.get("selected_states")
def get_dataframe_from_query(query):
"""
Helper function to fetch data from the API and return a DataFrame.
"""
response = requests.get(query)
if response.status_code == 200:
data = json.loads(response.text)
return pd.DataFrame.from_dict(data).T
else:
print(f"Request failed with status code {response.status_code}")
return None
def generate_demographics_for_chunk(chunk, total_places):
"""
Worker function to process a chunk of the DataFrame.
"""
for i, row in chunk.iterrows():
try:
process_place(i, row, total_places)
except Exception as e:
print(f"Error processing row {row['GEOID']}: {e}")
def process_place(i, row, total_places):
"""
Function to process a single place and generate demographics.
"""
# Extract GEOID and determine the query location
geoid = row['GEOID']
# Determine query parameters based on GEOID length
if len(geoid) == 10:
StateFIPS = geoid[:2]
CountyFIPS = geoid[2:5]
SubdivisionFIPS = geoid[5:]
location = f'&for=county%20subdivision:{SubdivisionFIPS}&in=state:{StateFIPS}&in=county:{CountyFIPS}'
elif len(geoid) == 7:
StateFIPS = geoid[:2]
PlaceFIPS = geoid[2:]
location = f'&for=place:{PlaceFIPS}&in=state:{StateFIPS}'
elif len(geoid) == 5:
StateFIPS = geoid[:2]
CountyFIPS = geoid[2:]
location = f'&for=county:{CountyFIPS}&in=state:{StateFIPS}'
else:
print(f"Invalid GEOID length for {geoid}. Skipping.")
return
today = date.today()
formatted_date = today.strftime("%m-%d-%Y")
#gazette_file = os.path.join(input_dir, "2020_Places_Combined_with_Counties.csv")
state = get_state_name_from_fips(StateFIPS)
# Prepare queries
host = 'https://api.census.gov/data'
year = '/2020'
dataset_acronym = '/dec/pl'
variables = 'NAME,P1_001N'
usr_key = f"&key={census_id}"
query = f"{host}{year}{dataset_acronym}?get={variables}{location}{usr_key}"
host = 'https://api.census.gov/data'
year = '/2020'
dataset_acronym_2020census = '/dec/pl'
dataset_acronym_2020acs5 = '/acs/acs5/subject'
dataset_acronym_2020dp = '/dec/dp'
dataset_acronym_2020dhc = '/dec/dhc'
g = '?get='
variables_2020census = 'NAME,P1_001N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,P1_008N,P1_009N,P2_002N,P2_005N' #H1_002N
variables_2020acs5 = 'NAME,S1101_C01_002E,S1101_C01_004E,S1101_C01_003E,S1903_C03_001E,S1903_C03_001M,S1903_C03_015E,S1903_C03_015M,S2001_C03_002E,S2001_C03_002M,S2001_C05_002E,S2001_C05_002M,S2001_C01_002E,S2001_C01_002M,S1702_C02_001E,S1701_C03_001E,S1701_C03_002E,S1701_C03_010E,S1501_C01_005E,S1501_C01_015E'
variables_2020dp = 'NAME,DP1_0002C,DP1_0003C,DP1_0004C,DP1_0005C,DP1_0006C,DP1_0007C,DP1_0008C,DP1_0009C,DP1_0010C,DP1_0011C,DP1_0012C,DP1_0013C,DP1_0014C,DP1_0015C,DP1_0016C,DP1_0017C,DP1_0018C,DP1_0019C,DP1_0021C,DP1_0073C,DP1_0025C,DP1_0049C,DP1_0069C,DP1_0045C,DP1_0133C,DP1_0142C,DP1_0138C,DP1_0139C,DP1_0143C,DP1_0141C,DP1_0147C,DP1_0132C,DP1_0145C'
variables_2020dhc = 'NAME,P16_002N'
usr_key = f"&key={census_id}" #Put it all together in one f-string:
query_2020census = f"{host}{year}{dataset_acronym_2020census}{g}{variables_2020census}{location}{usr_key}"# Use requests package to call out to the API
query_2020acs5 = f"{host}{year}{dataset_acronym_2020acs5}{g}{variables_2020acs5}{location}{usr_key}"
query_2020dp = f"{host}{year}{dataset_acronym_2020dp}{g}{variables_2020dp}{location}{usr_key}"
query_2020dhc = f"{host}{year}{dataset_acronym_2020dhc}{g}{variables_2020dhc}{location}{usr_key}"
queries = [
("2020 Census", query_2020census),
("2020 ACS5", query_2020acs5),
("2020 DP", query_2020dp),
("2020 DHC", query_2020dhc),
]
# Make API request
# Query and response handling for 2020 Census
response_2020census = requests.get(query_2020census)
if response_2020census.status_code == 200:
try:
alpha = response_2020census.text
beta = json.loads(alpha)
df_2020census = pd.DataFrame.from_dict(beta)
df_2020census = df_2020census.T
except Exception as e:
print(f"Error processing 2020 Census data for GEOID {geoid}: {e}")
else:
print(f"Failed to fetch 2020 Census data for GEOID {geoid}: {response_2020census.status_code}")
# Query and response handling for 2020 ACS5
response_2020acs5 = requests.get(query_2020acs5)
if response_2020acs5.status_code == 200:
try:
gamma = response_2020acs5.text
delta = json.loads(gamma)
df_2020acs5 = pd.DataFrame.from_dict(delta)
df_2020acs5 = df_2020acs5.T
except Exception as e:
print(f"Error processing 2020 ACS5 data for GEOID {geoid}: {e}")
else:
print(f"Failed to fetch 2020 ACS5 data for GEOID {geoid}: {response_2020acs5.status_code}")
# Query and response handling for 2020 DP
response_2020dp = requests.get(query_2020dp)
if response_2020dp.status_code == 200:
try:
epsilon = response_2020dp.text
iota = json.loads(epsilon)
df_2020dp = pd.DataFrame.from_dict(iota)
df_2020dp = df_2020dp.T
except Exception as e:
print(f"Error processing 2020 DP data for GEOID {geoid}: {e}")
else:
print(f"Failed to fetch 2020 DP data for GEOID {geoid}: {response_2020dp.status_code}")
# Query and response handling for 2020 DHC
response_2020dhc = requests.get(query_2020dhc)
if response_2020dhc.status_code == 200:
try:
theta = response_2020dhc.text
zeta = json.loads(theta)
df_2020dhc = pd.DataFrame.from_dict(zeta)
df_2020dhc = df_2020dhc.T
except Exception as e:
print(f"Error processing 2020 DHC data for GEOID {geoid}: {e}")
else:
print(f"Failed to fetch 2020 DHC data for GEOID {geoid}: {response_2020dhc.status_code}")
population= df_2020census[1][1] #P1_001N
population = float(population)
if population<25:
return
cityname = df_2020census[1][0]
if "district" in cityname.lower() and "district of columbia" not in cityname.lower():
return # Skip this iteration of the loop
city = process_place_string(cityname)
writtendirectory = output_dir+ '/{}'.format(state)
if not os.path.exists(writtendirectory):
os.makedirs(writtendirectory)
numberwhite=df_2020census[1][2] #P1_003N
numberblack= df_2020census[1][3]#P1_004N
numbernative= df_2020census[1][4]#P1_005N
numberasian= df_2020census[1][5]#P1_006N
numberpacificislander = df_2020census[1][6]#P1_007N
numberotherrace= df_2020census[1][7]#P1_008N
numbertwoormorerace= df_2020census[1][8] #P1_009N
numberhispanic= df_2020census[1][9] #P2_002N
numbernonhispanicwhite= df_2020census[1][10] #P2_005N
popunder5 = df_2020dp[1][1]#DP1_0002C
pop5to9 = df_2020dp[1][2]#DP1_0003C
pop10to14 = df_2020dp[1][3]#DP1_0004C
pop15to19 = df_2020dp[1][4]#DP1_0005C
pop20to24 = df_2020dp[1][5]#DP1_0006C
pop25to29 = df_2020dp[1][6]#DP1_0007C
pop30to34 = df_2020dp[1][7]#DP1_0008C
pop35to39 = df_2020dp[1][8]#DP1_0009C
pop40to44 = df_2020dp[1][9]#DP1_0010C
pop45to49 = df_2020dp[1][10]#DP1_0011C
pop50to54 = df_2020dp[1][11]#DP1_0012C
pop55to59 = df_2020dp[1][12]#DP1_0013C
pop60to64 = df_2020dp[1][13]#DP1_0014C
pop65to69 = df_2020dp[1][14]#DP1_0015C
pop70to74 = df_2020dp[1][15]#DP1_0016C
pop75to79 = df_2020dp[1][16]#DP1_0017C
pop80to84 = df_2020dp[1][17]#DP1_0018C
pop85plus = df_2020dp[1][18]#DP1_0019C
popover18 = df_2020dp[1][19]#DP1_0021C
popunder5 = float(popunder5)
pop5to9 = float(pop5to9)
pop10to14=float(pop10to14)
pop15to19=float(pop15to19)
pop20to24=float(pop20to24)
pop25to29=float(pop25to29)
pop30to34=float(pop30to34)
pop35to39=float(pop35to39)
pop40to44=float(pop40to44)
pop45to49=float(pop45to49)
pop50to54=float(pop50to54)
pop55to59=float(pop55to59)
pop60to64=float(pop60to64)
pop65to69=float(pop65to69)
pop70to74=float(pop70to74)
pop75to79=float(pop75to79)
pop80to84=float(pop80to84)
pop85plus=float(pop85plus)
popover18=float(popover18)
popunder18 = population - popover18
pop18to24 = pop20to24 + pop15to19 + popunder5 + pop5to9 + pop10to14 - popunder18
pop25to44 = pop25to29 + pop30to34 + pop35to39 + pop40to44
pop45to64 = pop45to49 + pop50to54 + pop55to59 + pop60to64
pop65plus = pop65to69 + pop70to74 + pop75to79 + pop80to84 + pop85plus
medianage= df_2020dp[1][20]#DP1_0073C
malepopulation = df_2020dp[1][21]#DP1_0025C
femalepopulation = df_2020dp[1][22]#DP1_0049C
femalepopulation18plus = df_2020dp[1][23]#DP1_0069C
malepopulation18plus = df_2020dp[1][24]#DP1_0045C
medianage=float(medianage)
malepopulation=float(malepopulation)
if malepopulation == 0:
return
femalepopulation = float(femalepopulation)
if femalepopulation == 0:
return
femalepopulation18plus=float(femalepopulation18plus)
malepopulation18plus=float(malepopulation18plus)
if malepopulation18plus == 0:
return
if femalepopulation18plus == 0:
return
femaletomaleratio= (femalepopulation/malepopulation)*100
femaletomaleratio = round(femaletomaleratio,1)
femaletomaleratio18plus= (femalepopulation18plus/malepopulation18plus)*100
femaletomaleratio18plus = round(femaletomaleratio18plus,1)
marriedcouples =df_2020dp[1][25]#DP1_0133C
femalelivingalone = df_2020dp[1][26]#DP1_0142C
malelivingalone = df_2020dp[1][27]#DP1_0138C
malelivingalone65plus = df_2020dp[1][28]#DP1_0139C
femalelivingalone65plus = df_2020dp[1][29]#DP1_0143C
femalehouseholder = df_2020dp[1][30]#DP1_0141C
numberofhousingunits= df_2020dp[1][31]#DP1_0147C
totalhouseholds = df_2020dp[1][32]#DP1_0132C
under18households= df_2020dp[1][33]#DP1_0145C
avghouseholdsize= df_2020acs5[1][1]#S1101_C01_002E
avgfamilysize= df_2020acs5[1][2]#S1101_C01_004E
totalfamilies = df_2020dhc[1][1]#P16_002N
medianhouseholdincome= df_2020acs5[1][4]#S1903_C03_001E
medianhouseholdincomestd= df_2020acs5[1][5]#S1903_C03_001M
medianfamilyincome= df_2020acs5[1][6]#S1903_C03_015E
medianfamilyincomestd= df_2020acs5[1][7]#S1903_C03_015M
medianmaleincome= df_2020acs5[1][8]#S2001_C03_002E
medianmaleincomestd= df_2020acs5[1][9]#S2001_C03_002M
medianfemaleincome= df_2020acs5[1][10]#S2001_C05_002E
medianfemaleincomestd= df_2020acs5[1][11]#S2001_C05_002M
percapitaincome= df_2020acs5[1][12]#S2001_C01_002E
percapitaincomestd= df_2020acs5[1][13]#S2001_C01_002M
percentpovertyfamily= df_2020acs5[1][14]#S1702_C02_001E
percentpovertypopulation= df_2020acs5[1][15]#S1701_C03_001E
percentpoverty18= df_2020acs5[1][16]#S1701_C03_002E
percentpoverty65= df_2020acs5[1][17]#S1701_C03_010E
medianhouseholdincome=int(medianhouseholdincome)
medianhouseholdincomestd = int(medianhouseholdincomestd)
medianfamilyincome=int(medianfamilyincome)
medianfamilyincomestd=int(medianfamilyincomestd)
medianmaleincome=int(medianmaleincome)
medianmaleincomestd=int(medianmaleincomestd)
medianfemaleincome=int(medianfemaleincome)
medianfemaleincomestd = int(medianfemaleincomestd)
percapitaincome=int(percapitaincome)
percapitaincomestd=int(percapitaincomestd)
bachelordegrees18to24 = df_2020acs5[1][18]#S1501_C01_005E
bachelordegrees18to24=float(bachelordegrees18to24)
bachelordegrees25plus = df_2020acs5[1][19]#S1501_C01_015E
bachelordegrees25plus=float(bachelordegrees25plus)
bachelordegreestotal = bachelordegrees18to24+bachelordegrees25plus
population = int(population)
# so = wptools.page('{}, {}'.format(city,state)).get_parse()
# infobox = so.data['infobox']
areami = row['ALAND_SQMI']
areami = float(areami)
areakm = areami*2.59
populationdensitymi = population/areami
populationdensitymi = round(populationdensitymi,1)
populationdensitykm = population/areakm
populationdensitykm = round(populationdensitykm,1)
numberofhousingunits = int(numberofhousingunits)
housingunitdensitymi = numberofhousingunits/areami
housingunitdensitymi = round(housingunitdensitymi,1)
housingunitdensitykm = numberofhousingunits/areakm
housingunitdensitykm = round(housingunitdensitykm,1)
numberwhite = int(numberwhite)
numberblack = int(numberblack)
numberasian = int(numberasian)
numbernative = int(numbernative)
numberpacificislander = int(numberpacificislander)
numberotherrace = int(numberotherrace)
numbertwoormorerace = int(numbertwoormorerace)
numberhispanic = int(numberhispanic)
numbernonhispanicwhite = int(numbernonhispanicwhite)
percentwhite = 100*(numberwhite/population)
percentwhite = round(percentwhite,2)
percentblack = 100*(numberblack/population)
percentblack = round(percentblack,2)
percentasian = 100*(numberasian/population)
percentasian = round(percentasian,2)
percentnative = 100*(numbernative/population)
percentnative = round(percentnative,2)
percentpacific = 100*(numberpacificislander/population)
percentpacific = round(percentpacific,2)
percentotherraces = 100*(numberotherrace/population)
percentotherraces = round(percentotherraces,2)
percenttwoormoreraces = 100*(numbertwoormorerace/population)
percenttwoormoreraces = round(percenttwoormoreraces,2)
percenthispanic = 100*(numberhispanic/population)
percenthispanic = round(percenthispanic,2)
percentnonhispanicwhite = 100*(numbernonhispanicwhite/population)
percentnonhispanicwhite = round(percentnonhispanicwhite,2)
totalhouseholds = float(totalhouseholds)
totalfamilies = float(totalfamilies)
under18households = float(under18households)
marriedcouples = float(marriedcouples)
if marriedcouples <= 0:
return
percentmarriedcouples = 100*(marriedcouples/totalhouseholds)
percentmarriedcouples = round(percentmarriedcouples,1)
percentunder18households = 100*(under18households/totalhouseholds)
percentunder18households = round(percentunder18households,1)
malelivingalone = float(malelivingalone)
femalelivingalone = float(femalelivingalone)
femalehouseholder = float(femalehouseholder)
percentfemalehouseholder = 100*(femalehouseholder/totalhouseholds)
percentfemalehouseholder = round(percentfemalehouseholder,1)
livingalone = malelivingalone + femalelivingalone
percentlivingalone = 100*(livingalone/totalhouseholds)
percentlivingalone = round(percentlivingalone,1)
malelivingalone65plus = float(malelivingalone65plus)
femalelivingalone65plus = float(femalelivingalone65plus)
livingalone65plus = malelivingalone65plus + femalelivingalone65plus
livingalone65plus = float(livingalone65plus)
percentlivingalone65plus = 100*(livingalone65plus/totalhouseholds)
percentlivingalone65plus = round(percentlivingalone65plus,1)
avghouseholdsize = float(avghouseholdsize)
avghouseholdsize = round(avghouseholdsize,1)
avgfamilysize = float(avgfamilysize)
avgfamilysize = round(avgfamilysize,1)
percentpopunder18 = 100*(popunder18/population)
percentpopunder18 = round(percentpopunder18,1)
percentpop18to24 = 100*(pop18to24/population)
percentpop18to24 = round(percentpop18to24,1)
percentpop25to44 = 100*(pop25to44/population)
percentpop25to44 = round(percentpop25to44,1)
percentpop45to64 = 100*(pop45to64/population)
percentpop45to64 = round(percentpop45to64,1)
percentpop65plus = 100*(pop65plus/population)
percentpop65plus = round(percentpop65plus,1)
percentbachelordegrees = 100*(bachelordegreestotal/population)
percentbachelordegrees = round(percentbachelordegrees,1)
totalhouseholds = int(totalhouseholds)
totalhouseholds = format(totalhouseholds, ",")
population = format(population, ",")
populationdensitymi = format(populationdensitymi, ",")
populationdensitykm = format(populationdensitykm, ",")
numberofhousingunits = format(numberofhousingunits,",")
numberwhite = format(numberwhite,",")
numberblack = format(numberblack,",")
numberasian = format(numberasian,",")
numbernative = format(numbernative,",")
numberpacificislander = format(numberpacificislander,",")
numberotherrace = format(numberotherrace,",")
numbertwoormorerace = format(numbertwoormorerace,",")
numberhispanic = format(numberhispanic,",")
numbernonhispanicwhite = format(numbernonhispanicwhite,",")
housingunitdensitykm = format(housingunitdensitykm,",")
housingunitdensitymi = format(housingunitdensitymi,",")
medianhouseholdincome = format(medianhouseholdincome,",")
medianfemaleincome = format(medianfemaleincome,",")
medianfemaleincomestd=format(medianfemaleincomestd,",")
percapitaincome=format(percapitaincome,",")
percapitaincomestd=format(percapitaincomestd,",")
medianhouseholdincomestd=format(medianhouseholdincomestd,",")
medianfamilyincome=format(medianfamilyincome,",")
medianfamilyincomestd=format(medianfamilyincomestd,",")
medianmaleincome=format(medianmaleincome,",")
medianmaleincomestd=format(medianmaleincomestd,",")
totalfamilies = int(totalfamilies)
totalfamilies = format(totalfamilies, ",")
outputtextfilename = cityname
cityname = cityname.replace(" ","%20")
line23 = '===2020 census==='
line24 = '\n'
line1 = "The [[2020 United States census|2020 United States census]] counted %s people, %s households, and %s families " % (population, totalhouseholds, totalfamilies)
line2 = "in {}.<ref>{{{{Cite web |title=US Census Bureau, Table P16: HOUSEHOLD TYPE |url=https://data.census.gov/table?q={}%20p16&y=2020 |access-date={} |website=data.census.gov}}}}</ref><ref name="":0"" />".format(city,cityname,formatted_date)
line22 = " The population density was %s per square mile (%s/km{{sup|2}})." % (populationdensitymi, populationdensitykm)
line3 = " There were %s housing units at an average density of %s per square mile (%s/km{{sup|2}})." % (numberofhousingunits,housingunitdensitymi, housingunitdensitykm)
line21 = "<ref name="":0"">{{{{Cite web |title=US Census Bureau, Table DP1: PROFILE OF GENERAL POPULATION AND HOUSING CHARACTERISTICS |url=https://data.census.gov/table/DECENNIALDP2020.DP1?q={}%20dp1 |access-date={} |website=data.census.gov}}}}</ref><ref>{{{{Cite web |last=Bureau |first=US Census |title=Gazetteer Files |url=https://www.census.gov/geographies/reference-files/2020/geo/gazetter-file.html |access-date=2023-12-30 |website=Census.gov}}}}</ref> ".format(cityname,formatted_date)
line4 = "The racial makeup was {}% ({}) [[White (U.S. Census)|white]] or [[European American|European American]] ({}% [[Non-Hispanic White|non-Hispanic white]]), {}% ({}) [[African American (U.S. Census)|black]] or [[African American|African-American]], {}% ({}) [[Native American (U.S. Census)|Native American]] or [[Alaska Native|Alaska Native]], {}% ({}) [[Asian (U.S. Census)|Asian]], {}% ({}) [[Pacific Islander (U.S. Census)|Pacific Islander]] or [[Native Hawaiian|Native Hawaiian]], ".format(percentwhite,numberwhite,percentnonhispanicwhite,percentblack,numberblack,percentnative,numbernative,percentasian,numberasian,percentpacific,numberpacificislander)
line5 = "{}% ({}) from [[Race (United States Census)|other races]], and {}% ({}) from [[Multiracial Americans|two or more races]].<ref>{{{{Cite web |title=US Census Bureau, Table P1: RACE |url=https://data.census.gov/table/DECENNIALPL2020.P1?q={}%20p1&y=2020 |access-date={} |website=data.census.gov}}}}</ref> [[Hispanic (U.S. Census)|Hispanic]] or [[Latino (U.S. Census)|Latino]] of any race was {}% ({}) of the population.<ref>{{{{Cite web |title=US Census Bureau, Table P2: HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE |url=https://data.census.gov/table/DECENNIALPL2020.P2?q={}%20p2&y=2020 |access-date={} |website=data.census.gov}}}}</ref>".format(percentotherraces,numberotherrace,percenttwoormoreraces,numbertwoormorerace,cityname,formatted_date,percenthispanic,numberhispanic,cityname,formatted_date)
line6 = "\n"
line7 = "\n"
line8 = "Of the {} households, {}% had children under the age of 18; {}% were married couples living together; {}% had a female householder with no".format(totalhouseholds,percentunder18households,percentmarriedcouples, percentfemalehouseholder)
line9 = " spouse or partner present. {}% of households consisted of individuals and {}% had someone ".format(percentlivingalone,percentlivingalone65plus)
line10 = "living alone who was 65 years of age or older.<ref name="":0"" /> The average household size was {} and the average family size was {}.<ref>{{{{Cite web |title=US Census Bureau, Table S1101: HOUSEHOLDS AND FAMILIES |url=https://data.census.gov/table/ACSST5Y2020.S1101?q={}%20s1101%20&y=2020 |access-date={} |website=data.census.gov}}}}</ref> The percent of those with a bachelor’s degree or higher was estimated to be {}% of the population.<ref>{{{{Cite web |title=US Census Bureau, Table S1501: EDUCATIONAL ATTAINMENT |url=https://data.census.gov/table/ACSST5Y2020.S1501?q={}%20s1501%20&y=2020 |access-date={} |website=data.census.gov}}}}</ref>".format(avghouseholdsize,avgfamilysize,cityname,formatted_date,percentbachelordegrees,cityname,formatted_date)
line11 = "\n"
line12 = "\n"
line13 = "{}% of the population was under the age of 18, {}% from 18 to 24, {}% from 25 to 44, {}% from 45 to 64, and {}% who were 65 years of age or older.".format(percentpopunder18,percentpop18to24,percentpop25to44,percentpop45to64,percentpop65plus)
line14 = " The median age was {} years. For every 100 females, there were {} males.<ref name="":0"" /> For every 100 females ages 18 and older, there were {} males.<ref name="":0"" />".format(medianage,femaletomaleratio,femaletomaleratio18plus)
line15 = "\n"
line16 = "\n"
# Define invalid values for income fields
invalid_values = {"-666,666,666", "-222,222,222", "-333,333,333","-666666666.0"}
# Determine the main combined text
if all(
value in invalid_values
for value in [
medianhouseholdincome,
medianhouseholdincomestd,
medianfamilyincome,
medianfamilyincomestd,
]
):
combined_text = (
"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that "
# "no valid income data is available.<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME "
# "IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={}%20s1903%20&y=2020 "
# "|access-date={} |website=data.census.gov}}</ref>".format(cityname, formatted_date)
)
else:
# Handle household income
if medianhouseholdincome not in invalid_values:
medianhouseholdincome_numeric = int(medianhouseholdincome.replace(",",""))
if medianhouseholdincomestd not in invalid_values:
household_income_text = (
"The median household income was ${} (with a margin of error of +/- ${}).".format(
medianhouseholdincome, medianhouseholdincomestd
)
)
elif medianhouseholdincome_numeric < 250001:
household_income_text = "The median household income was ${}.".format(
medianhouseholdincome
)
elif medianhouseholdincome_numeric >= 250001:
household_income_text = "The median household income was greater than $250,000."
else:
household_income_text = ""
# Handle family income
if medianfamilyincome not in invalid_values:
medianfamilyincome_numeric = int(medianfamilyincome.replace(",", ""))
if medianfamilyincomestd not in invalid_values:
family_income_text = (
" The median family income was ${} (+/- ${}).".format(
medianfamilyincome, medianfamilyincomestd
)
)
elif medianfamilyincome_numeric < 250001:
family_income_text = " The median family income was ${}.".format(
medianfamilyincome
)
elif medianfamilyincome_numeric >= 250001:
family_income_text = " The median family income was greater than $250,000."
else:
family_income_text = ""
if family_income_text != "" and household_income_text != "":
combined_text = (
f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {household_income_text}{family_income_text}"
f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
)
elif family_income_text != "" and household_income_text == "":
combined_text = (
f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {family_income_text}"
f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
)
elif family_income_text == "" and household_income_text != "":
combined_text = (
f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that {household_income_text}"
f"<ref>{{Cite web |title=US Census Bureau, Table S1903: MEDIAN INCOME IN THE PAST 12 MONTHS "
f"(IN 2020 INFLATION-ADJUSTED DOLLARS) |url=https://data.census.gov/table/ACSST5Y2020.S1903?q={cityname}%20s1903%20&y=2020 "
f"|access-date={formatted_date} |website=data.census.gov}}</ref>"
)
elif family_income_text == "" and household_income_text == "":
combined_text = (
f"The 2016-2020 5-year [[American Community Survey|American Community Survey]] estimates show that "
)
# Gender income text
if medianmaleincome in invalid_values and medianfemaleincome in invalid_values:
gender_income_text = ""
elif medianmaleincome not in invalid_values and medianfemaleincome not in invalid_values:
if medianmaleincomestd in invalid_values and medianfemaleincomestd in invalid_values:
gender_income_text = " Males had a median income of ${} versus ${} for females.".format(
medianmaleincome, medianfemaleincome
)
elif medianmaleincomestd in invalid_values:
gender_income_text = " Males had a median income of ${} versus ${} (+/- ${}) for females.".format(
medianmaleincome, medianfemaleincome, medianfemaleincomestd
)
elif medianfemaleincomestd in invalid_values:
gender_income_text = " Males had a median income of ${} (+/- ${}) versus ${} for females.".format(
medianmaleincome, medianmaleincomestd, medianfemaleincome
)
else:
gender_income_text = " Males had a median income of ${} (+/- ${}) versus ${} (+/- ${}) for females.".format(
medianmaleincome, medianmaleincomestd, medianfemaleincome, medianfemaleincomestd
)
elif medianmaleincome in invalid_values:
if medianfemaleincomestd in invalid_values:
gender_income_text = " Females had a median income of ${}.".format(medianfemaleincome)
else:
gender_income_text = " Females had a median income of ${} (+/- ${}).".format(
medianfemaleincome, medianfemaleincomestd
)
elif medianfemaleincome in invalid_values:
if medianmaleincomestd in invalid_values:
gender_income_text = " Males had a median income of ${}.".format(medianmaleincome)
else:
gender_income_text = " Males had a median income of ${} (+/- ${}).".format(
medianmaleincome, medianmaleincomestd
)
# Per capita income text
per_capita_income_text = (
""
if percapitaincome in invalid_values
else " The median income for those above 16 years old was ${} (+/- ${}).<ref>{{{{Cite web |title=US Census Bureau, Table S2001: "
"EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)|url=https://data.census.gov/table/ACSST5Y2020.S2001?q={}%20s2001%20&y=2020 "
"|access-date={} |website=data.census.gov}}}}</ref>".format(
percapitaincome, percapitaincomestd, cityname, formatted_date
)
if percapitaincomestd not in invalid_values
else " The median income for those above 16 years old was ${}.<ref>{{{{Cite web |title=US Census Bureau, Table S2001: "
"EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)|url=https://data.census.gov/table/ACSST5Y2020.S2001?q={}%20s2001%20&y=2020 "
"|access-date={} |website=data.census.gov}}}}</ref>".format(
percapitaincome, cityname, formatted_date
)
)
# Poverty text
if all(
value in invalid_values
for value in [
percentpovertyfamily,
percentpovertypopulation,
percentpoverty18,
percentpoverty65,
]
):
poverty_text = "" # Exclude poverty text entirely if all values are invalid
else:
# Handle individual cases and permutations
family_text = (
f"{percentpovertyfamily}% of families"
if percentpovertyfamily not in invalid_values
else ""
)
population_text = (
f"{percentpovertypopulation}% of the population"
if percentpovertypopulation not in invalid_values
else ""
)
under_18_text = (
f"{percentpoverty18}% of those under the age of 18"
if percentpoverty18 not in invalid_values
else ""
)
over_65_text = (
f"{percentpoverty65}% of those ages 65 or over"
if percentpoverty65 not in invalid_values
else ""
)
# Combine valid components dynamically
main_components = [text for text in [family_text, population_text] if text]
main_text = " and ".join(main_components)
additional_components = [text for text in [under_18_text, over_65_text] if text]
additional_text = " and ".join(additional_components)
# Construct poverty_text dynamically
if main_text and additional_text:
poverty_text = (
f" Approximately, {main_text} were below the [[poverty line]], including {additional_text}."
)
elif main_text:
poverty_text = f" Approximately, {main_text} were below the [[poverty line]]."
else:
poverty_text = ""
# Append references for poverty data if there's any text
if poverty_text:
poverty_text += (
f"<ref>{{{{Cite web |title=US Census Bureau, Table S1701: POVERTY STATUS IN THE PAST 12 MONTHS |url=https://data.census.gov/table/ACSST5Y2020.S1701?q={cityname}%20s1701%20&y=2020 "
f"|access-date={formatted_date} |website=data.census.gov}}}}</ref>"
f"<ref>{{{{Cite web |title=US Census Bureau, Table S1702: POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES |url=https://data.census.gov/table/ACSST5Y2020.S1702?q={cityname}%20s1702&y=2020 "
f"|access-date={formatted_date} |website=data.census.gov}}}}</ref>"
)
# Combine all texts
final_text = f"{combined_text}{gender_income_text}{per_capita_income_text}{poverty_text}"
# Apply corrections
final_text = fix_space_after_ref(final_text)
if "]]Estimates" in final_text:
final_text = final_text.replace("]]Estimates", "]] estimates")
# Fix "The" capitalization after "show that"
if "show that The" in final_text:
final_text = final_text.replace("show that The", "show that the")
else:
# Alternative checks if exact match fails
snippet_start = final_text.find("show that")
if snippet_start != -1:
snippet = final_text[snippet_start:snippet_start + 20] # Extract a snippet around "show that"
print("DEBUG: Snippet around 'show that':", snippet)
# Check with potential variations
if "show that The" in final_text:
final_text = final_text.replace("show that The", "show that the")
elif "show that the" in final_text.lower():
# Normalize capitalization where "show that the" exists in any casing
final_text = final_text[:snippet_start] + final_text[snippet_start:].replace("The", "the", 1)
# Fix double spaces
while " " in final_text:
final_text = final_text.replace(" ", " ")
# Fix percentages with extra spaces (e.g., "14. 7%" -> "14.7%")
final_text = re.sub(r"(\d)\.\s+(\d)", r"\1.\2", final_text)
# Trim spaces at the start and end of the text
final_text = final_text.strip()
# Fix spaces after </ref>
final_text = fix_space_after_ref(final_text)
line120 = line23+line24+line1+line2+line22+line3+line21+line4+line5+line6+line7+line8+line9+line10+line11+line12+line13+line14+line15+line16+final_text
with open(writtendirectory + '/%s_Demographics.txt' % (outputtextfilename), 'w+') as text_file:
print(f"{line120}", file=text_file)
print(f"Processing: {outputtextfilename}")
# Print remaining places
#print(total_places - i - 1, "places left")
def generate_demographics(census_id, gazette_file, output_dir, selected_states):
"""
Main function to generate demographics in parallel.
"""
today = date.today()
formatted_date = today.strftime("%m-%d-%Y")
# Load gazette data and filter by selected states
gazette_data = pd.read_csv(gazette_file, dtype=str)
selected_states = get_abbreviations_from_selected_states(selected_states)
filtered_df = gazette_data[gazette_data['USPS'].isin(selected_states)].copy()
# Define Central Time timezone
central_time = pytz.timezone('America/Chicago')
total_places = len(filtered_df)
# Calculate time range (50% and 70% of list length)
lower_bound = total_places * 0.5
upper_bound = total_places * 0.7
# Get current time in Central Time
current_time_utc = datetime.now(pytz.utc) # Get current time in UTC
current_time_ct = current_time_utc.astimezone(central_time) # Convert to Central Time
# Calculate expected completion times in Central Time
completion_time_lower_ct = current_time_ct + timedelta(seconds=lower_bound)
completion_time_upper_ct = current_time_ct + timedelta(seconds=upper_bound)
print(f"Expected completion time range in Mountain Time: {completion_time_lower_ct} - {completion_time_upper_ct}")
num_chunks = min(10, len(filtered_df)) # Use at most 10 chunks or fewer if the DataFrame is small
chunk_size = max(1, len(filtered_df) // num_chunks) # Ensure chunk size is at least 1
chunks = [filtered_df.iloc[i:i + chunk_size] for i in range(0, len(filtered_df), chunk_size)]
# Process chunks in parallel
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(lambda chunk: generate_demographics_for_chunk(chunk, total_places), chunks)
def get_state_name_from_fips(state_fips):
# Dictionary mapping StateFIPS codes to state names
fips_to_state_name = {
"01": "Alabama",
"02": "Alaska",
"04": "Arizona",
"05": "Arkansas",
"06": "California",
"08": "Colorado",
"09": "Connecticut",
"10": "Delaware",
"11": "District of Columbia",
"12": "Florida",
"13": "Georgia",
"15": "Hawaii",
"16": "Idaho",
"17": "Illinois",
"18": "Indiana",
"19": "Iowa",
"20": "Kansas",
"21": "Kentucky",
"22": "Louisiana",
"23": "Maine",
"24": "Maryland",
"25": "Massachusetts",
"26": "Michigan",
"27": "Minnesota",
"28": "Mississippi",
"29": "Missouri",
"30": "Montana",
"31": "Nebraska",
"32": "Nevada",
"33": "New Hampshire",
"34": "New Jersey",
"35": "New Mexico",
"36": "New York",
"37": "North Carolina",
"38": "North Dakota",
"39": "Ohio",
"40": "Oklahoma",
"41": "Oregon",
"42": "Pennsylvania",
"44": "Rhode Island",
"45": "South Carolina",
"46": "South Dakota",
"47": "Tennessee",
"48": "Texas",
"49": "Utah",
"50": "Vermont",
"51": "Virginia",
"53": "Washington",
"54": "West Virginia",
"55": "Wisconsin",
"56": "Wyoming",
}
# Return the state name from StateFIPS
return fips_to_state_name.get(state_fips, "State FIPS code not found")
def get_abbreviations_from_selected_states(selected_states):
# Dictionary mapping state names to abbreviations
state_to_abbreviation = {
"Alabama": "AL",
"Alaska": "AK",
"Arizona": "AZ",
"Arkansas": "AR",
"California": "CA",
"Colorado": "CO",
"Connecticut": "CT",
"Delaware": "DE",
"Florida": "FL",
"Georgia": "GA",
"Hawaii": "HI",
"Idaho": "ID",
"Illinois": "IL",
"Indiana": "IN",
"Iowa": "IA",
"Kansas": "KS",
"Kentucky": "KY",
"Louisiana": "LA",
"Maine": "ME",
"Maryland": "MD",
"Massachusetts": "MA",
"Michigan": "MI",
"Minnesota": "MN",
"Mississippi": "MS",
"Missouri": "MO",
"Montana": "MT",
"Nebraska": "NE",
"Nevada": "NV",
"New Hampshire": "NH",
"New Jersey": "NJ",
"New Mexico": "NM",
"New York": "NY",
"North Carolina": "NC",
"North Dakota": "ND",
"Ohio": "OH",
"Oklahoma": "OK",
"Oregon": "OR",
"Pennsylvania": "PA",
"Rhode Island": "RI",
"South Carolina": "SC",
"South Dakota": "SD",
"Tennessee": "TN",
"Texas": "TX",
"Utah": "UT",
"Vermont": "VT",
"Virginia": "VA",
"Washington": "WA",
"West Virginia": "WV",
"Wisconsin": "WI",
"Wyoming": "WY",
}
# Create a list of abbreviations for the selected states
abbreviations = [state_to_abbreviation.get(state, "State not found") for state in selected_states]
return abbreviations
def process_place_string(place_string):
# Skip specific substrings
if "County subdivisions not defined" in place_string or "Municipio subdivision not defined" in place_string or "County subdivisions not defined" in place_string:
return None
# Split the string into words
words = place_string.split()
# Identify the cutoff point
cutoff_index = 0
for i, word in enumerate(words):
# Check if the word is all caps (like an abbreviation)
if word.isupper():
break
# Check if the word starts with a capital letter
elif word[0].isupper():
cutoff_index = i + 1
else:
break
# Return the string up to the cutoff point
return " ".join(words[:cutoff_index])
def fix_space_after_ref(text):
"""
Adds a space after </ref> if the next character is not '<' or a space.
"""
corrected_text = ""
i = 0
while i < len(text):
if text[i:i+6] == "</ref>" and i+6 < len(text):
next_char = text[i+6]
if next_char != '<' and next_char != ' ':
corrected_text += "</ref> " # Add </ref> followed by a space
else:
corrected_text += "</ref>" # Keep </ref> as-is
i += 6 # Skip over "</ref>"
else:
corrected_text += text[i] # Add the current character
i += 1 # Move to the next character
return corrected_text
def correct_random_capitalization_and_fix_spaces(text):
"""
Corrects random capitalization and fixes spacing issues while preserving text within [[ ]] and <ref> </ref>.
Ensures proper formatting for inline phrases and removes extra spaces.
Args:
text (str): The input text with potentially incorrect capitalization and spacing.
Returns:
str: Corrected text.
"""
# Define patterns for preserving [[ ]] and <ref> tags
patterns_to_ignore = r'(\[\[.*?\]\])|(<ref>.*?</ref>)'
# Split text into parts to process or preserve
parts = re.split(patterns_to_ignore, text)
corrected_text = []
for part in parts:
if part is None:
continue
# Preserve parts within [[ ]] and <ref> as-is
if re.match(patterns_to_ignore, part):
corrected_text.append(part)
else:
# Remove double spaces and fix capitalization
cleaned_part = re.sub(r'\s{2,}', ' ', part.strip())
sentences = re.findall(r'[^.!?]*[.!?]?\s*', cleaned_part)
corrected_sentences = []
for i, sentence in enumerate(sentences):
stripped_sentence = sentence.strip()
if not stripped_sentence:
# Preserve empty spaces or breaks
corrected_sentences.append(sentence)
continue
# Check for inline continuation (e.g., "show that the")
if i > 0 and corrected_sentences[-1].strip().endswith(("that", "of", "for", "and", "or")):
corrected = stripped_sentence[0].lower() + stripped_sentence[1:]
else:
# Standard capitalization for new sentences
corrected = stripped_sentence[0].upper() + stripped_sentence[1:]
# Ensure specific terms retain proper casing (e.g., 'estimates')
corrected = corrected.replace("Estimates", "estimates")
corrected_sentences.append(corrected)
corrected_text.append(' '.join(corrected_sentences))
# Reassemble corrected parts and fix lingering double spaces
final_text = ''.join(corrected_text).strip()
final_text = re.sub(r'\s{2,}', ' ', final_text)
# Fix spacing around punctuation (e.g., "14. 7%")
final_text = re.sub(r'(\d)\.\s+(\d)', r'\1.\2', final_text)
return final_text
import time
tic = time.time()
generate_demographics(census_id, gazette_file, output_dir, selected_states)
toc = time.time()
print(toc-tic,'seconds elapsed')
Cell 3
[edit]import shutil
import json
import os
from google.colab import files
# Load the input variables from the JSON file
temp_json_file = "/content/demographics_config.json" # Path to your JSON file
with open(temp_json_file, "r") as f:
config = json.load(f)
# Extract selected states from the JSON
selected_states = config.get("selected_states", [])
output_dir = config.get("output_dir", "/content/output")
# Loop through each selected state and create a ZIP file
for state in selected_states:
folder_to_download = os.path.join(output_dir, state)
# Check if the folder exists before zipping
if os.path.exists(folder_to_download):
output_zip_file = f"{folder_to_download}.zip"
# Compress the folder into a ZIP file
shutil.make_archive(output_zip_file.replace(".zip", ""), 'zip', folder_to_download)
# Download the ZIP file
files.download(output_zip_file)
else:
print(f"Folder for state '{state}' not found. Skipping.")
- ^ McManus, Michael (22-01-2022). "Using the U.S. Census Bureau API with Python".
{{cite web}}
: Check date values in:|date=
(help)