Replit Agent Hates Large Datasets

I’ve been trying to complete a specific task on my Tax Prep. Application for the past three days.

The agent and Replit database are struggling to handle the large datasets.

Using Databricks or Snowflake for this purpose is risky until you’ve undergone formal training.

I’ve been stuck trying to load tax data from all 50 states in the United States.

Every time I attempt to load it, the agent responds with, “Agent encountered an error while running; we are investigating the issue.”

It simply can’t handle the data. I’ve officially reached the database limit! Which is exciting for the geek in me who was trying to push this project to its limits. However, it’s not ideal for the team lead in me who is frustrated and throwing peanuts at a Replit logo in the gallows.

I create a ticket.

I’m simply informing everyone that the native Replit database and agent are not equipped to integrate huge datasets like the tax data for all 50 states.

At least not without careful planning, using external database infrastructure, and then tailoring the prompts to a very specific level.

I spent approximately $5 to $10 in computation credits, and each time the agent encounters a problem right after it enters the Ohio or Texas counties, it generates this error message and sends, “Agent encountered an error while running, we are investigating the issue.”

This is the maximum limit. I must admit, the geek in me understands that this essentially means Replit has a very high ceiling, but it does have a limit.

This amount of data would cause any agent except perhaps Cursor, which I haven’t tested yet (I’m waiting for them to increase prices and change tiers after the next few funding rounds dry up).

To get a feel of what I’m trying to do:

1.) I’m making a GigPrep tax tool for USA citizens. Those who receive 1099 forms and have to file self-employment (on top of, or instead of, regular 1040).

2.) I compiled a CSV document containing all 50 states’ tax websites using AI. Then, I requested the agent to utilize this list to update the jurisdiction section of my application.

3.) I used the IRS website for federal tax purposes.

4.) I created a wizard for deductions that assists users in curating a formal 1099 itemized deductions sheet.

5.) The jurisdiction section failed to properly integrate all the counties within each of the 50 states. This is a substantial amount of data, and many .gov websites prohibit scraping.

6.) This discrepancy causes issues in the wizard tool when a user attempts to use their zip code or county to obtain the most specific information.

7.) There are limited up-to-date open-source tax software solutions available. You must rely on the actual state or government websites and the most recent information as the source of truth. Older software is useful for logic but not for accuracy. These changes occur constantly at the federal, state, and county levels.

The tool will be used for several years for my gigs and will eventually be integrated if I decide to formalize it.

I hope to eventually open it up and provide it to other gig workers.

This is NOT a tax filer like TurboTax or H&R Block.

This IS a tax preparation tool that generates an organized PDF file for your CPA to quickly review, verify, and incorporate into your taxes.

It is solely designed to assist in preparing and organizing for filing 1099 self-employment taxes.

It provides a curated document containing all the information in an extremely organized and easily digestible format.

If anyone reading this is curious about the task item that causes this error and crashes the agent, I’ll paste it below:

(It just spins its wheels. AI Quinn gave basic feedback. Might be git commit issue persisting.)

I’ve done:

kill 1 

Several times. No luck.

Might be project disk space so I’ll use:

du -sh .

du -sh .git

The git status may be in part. However, I think this is from me hitting the agent ceiling.

I’ve been forcing it to collect, parse, and incorporate the federal information along with all 50 states counties (approx. 3,143 each with their own rules, regulations, and specific state site. Most hate being scrapped and parsed.

I think this is the, “Natural Ceiling”, of the replit agent (for now).

It’s pretty high, I mean the task was monumental and it did a decent job… Until it didn’t lol

Drop Down For Task Item

#1 - Jurisdiction Service Completion

What & Why

zip_jurisdiction_resolver.py currently imports _derive_county_from_place and _build_jurisdiction_object from app.services.manual_jurisdiction_resolver — a module that doesn’t exist yet. This import error breaks the ZIP resolver at runtime. All the logic it needs is currently living inside app/routes/jurisdiction.py as private inline functions, making it untestable and tightly coupled to the route layer.

This task extracts that logic into a proper service module, expands the county data to full coverage for all relevant states, and adds two missing routes the frontend needs.

Done looks like

  • app/services/manual_jurisdiction_resolver.py exists with these public functions: resolve_jurisdiction_from_manual_selection(), fetch_counties_for_state(), fetch_places_for_county(), _derive_county_from_place(), _build_jurisdiction_object(), _get_state_tax_rules(), _get_county_tax_rules(), _get_municipal_rules()
  • app/services/zip_jurisdiction_resolver.py imports from the new module cleanly with no errors
  • GET /jurisdiction/places-for-county?state_code=OH&county_name=Franklin+Countyworks and returns a list of place name strings
  • GET /jurisdiction/summary returns the current user’s saved jurisdiction from their profile
  • fetch_counties_for_state("OH") returns all 88 Ohio counties; fetch_counties_for_state("TX") returns all 254 Texas counties
  • app/routes/jurisdiction.py delegates to the two service modules instead of containing jurisdiction logic inline
  • The app starts, /health returns 200, and both the ZIP-resolver and manual-resolver endpoints return correctly-structured JSON

Out of scope

  • ZCTA crosswalk file (hand-coded ZIP table in zip_jurisdiction_resolver.py is sufficient)
  • Ohio school district crosswalk file
  • Tests (separate task)

Tasks

  1. Create manual_jurisdiction_resolver.py — Move _derive_county_from_place(), _build_jurisdiction_object(), _get_state_tax_rules(), _get_county_tax_rules(), _get_municipal_rules(), and _get_counties_for_state() from app/routes/jurisdiction.py into the new service. Add resolve_jurisdiction_from_manual_selection() as a thin public wrapper around _build_jurisdiction_object(), and add fetch_places_for_county() which maps a (state_code, county_name) pair to a list of known place names using the same lookup tables already present in _derive_county_from_place().
  2. Expand county data — Replace the partial _get_counties_for_state()county lists with complete lists: all 88 Ohio counties, all 254 Texas counties, and complete lists for every state that has local income taxes (MD, IN, PA, KY, MI, NY). Use authoritative county names that match the keys already present in local_income_taxes.py and _get_county_tax_rules(). Return results sorted alphabetically.
  3. Update app/routes/jurisdiction.py — Replace all inline private functions with imports from the two service modules. Add GET /jurisdiction/places-for-county?state_code=&county_name=(returns {"success": true, "data": {"places": [...]}}) and GET /jurisdiction/summary (returns the jurisdiction_data from the authenticated user’s profile, or an empty object if not yet resolved).
  4. Verify — Confirm python -c "from app.services.zip_jurisdiction_resolver import resolve_jurisdiction_from_zip"imports cleanly, and confirm the /jurisdiction/states/OH/countiesendpoint returns all 88 counties.

Relevant files

  • app/services/zip_jurisdiction_resolver.py:177-208
  • app/routes/jurisdiction.py:215-747
  • app/data/local_income_taxes.py
  • app/data/tax_jurisdictions.py
  • app/utils/constants.py
  • app/utils/helpers.py

I use large files on mine, how large are yours (rows + columns) and in size? I’m moving a few million records a day but you might be doing more if you’re parsing govt data. I load my CSV into memory and parse it in chunks, neon has a limit on how much data can be pushed to the database so you will need to account for that. Give me some more details and I can try to help. You don’t want this parsed as tokens in the agent

It’s hard to tell with this one what the underlying issue is that is causing the agent to crash.

Here are pngs of the compiled tax links per state. The federal information is easy…

IRS is one Source of Truth.

Individual State Individual state jurisdiction is what causes the agent to break.

It’s either one (or many) gov links not liking to be scraped, or the sheer magnitude of data.

If you look at the GigPrep tool that I just posted. That is what I’ve been working on that made it crash. I went ahead and tried to do just one state and that alone was giving me problems.

I’m not asking for it to just scrape basic info. I’m asking it to scrape & parse accurate tax information per state, then per county.

I’m trying to figure out the best way to get all 50 states (that file fed / state taxes) into a single database that always updates and points to individual states.

Next year, or even throughout the year these sites update to reflect the changes. Meaning the app will need to either be updated once a quarter, or find a way to point to the individual state data in real time…

I’m running into the natural wall of Gov sites being protected and not liking me doing this…

just crashed trying to do it again lol STG this is the limit.

I refuses to budge past this:

T001: Expand county lists in manual_jurisdiction_resolver.py

  • Blocked By:
  • Details:
    • Replace partial OH county list with all 88 Ohio counties
    • Replace partial TX county list with all 254 Texas counties
    • Expand county lists for all states in the local income tax dataset (IN all 92, MD all 24, etc.)
    • Add fetch_places_for_county() function
    • Files: app/services/manual_jurisdiction_resolver.py

It can’t handle processing all the states, and their counties…

Even one at a time it struggled…

I’ve approached this from several angles today, then I paced around looking at it, I walked away from it, I worked on similar but different projects.

No luck. Just wasting computation & brain power.

I really don’t want to have to do it one state at a time, and at the length and price it took to secure the one, and still not know if it pulled it all, or if it’s accurate…

This is a low risk build to get my feet wet per se and well… My feet are soaked in fact, I slid from feet wet, right into knee / waist deep.

I gotta stop now or I’ll be shot mentally and enter waters I’m not prepared for tonight…

I’m going to have to sleep on it.

This is how much it costs to crash the agent. I cannot keep eating that cost.

My largest test database with a Replit project is 150 GiB. Replit is doing alright, but I’m using AWS and Replit together on that one.

It might do you well to have Replit use Supabase, Railway, or AWS if it’s struggling.

Happy to help you think through the problem and your options.

1 Like

I can see where you are headed with this but don’t fight the agent on this. The county data already exists as structured datasets from Census.gov (FIPS codes). Pull the CSV, write a small ingestion script that builds your lookup dictionaries, and let the agent write that instead of trying to generate thousands of county names from memory. The agent’s great at writing the 30-line loader script — it’s terrible at being a database. Same approach I use for large catalog imports on my project, it’s dumb data from static files (or graphql) and it’s then parsed into the format I need into our database.

Grab this CSV that has every US county with FIPS codes and state abbreviations already mapped: fips-codes/state_and_county_fips_master.csv at master · kjhealy/fips-codes · GitHub

Or hit the Census API directly, one call gets you every county in the country:

curl “https://api.census.gov/data/2020/dec/dp?get=NAME&for=county:*”

2. Then have your agent write a small Python script that reads that CSV/JSON and builds your county dictionaries in manual_jurisdiction_resolver.py programmatically. That’s like a 30 line script, well within what any agent can handle in one pass.

1 Like

After generating the counties and locations for all 50 states, I can finally point the source to the main source of truth!

Yes, that’s it!

The locations don’t change much, and the counties remain the same. I can extract that information to get the correct individual counties and location details, which I can then input.

Once the data is organized with all the accurate locations, it will be easier to point to the source of truth.

I understand exactly what you mean. Yes, that’s right! :raising_hands:

It’s nearly 2 am, and I’ve been banging my head against a wall for almost 12+ hours on this problem, but finally, yes!

You’re right. That’s what caused the agent to crash.

I can definitely extract the data from there as a CSV file, just for jurisdictions.

Accurate tax information websites can then be added once per state, and I can have it log the data into the already existing counties after I’ve baked them in.

You’re right, the agent can’t manually handle the large dataset. Haha :joy:

That’s such a clutch workaround. That’s exactly what I’ll do tomorrow.

1 Like

Been there done that, I realized the context windows get blown out even in Opus or any of the newer models, just old school fixes that problem. Keep us posted on progress!

1 Like

What type of dataset is it?

1 Like

This was the original build document set if curious.

I went hard.

I think you might be unintentionally giving too much context for the agent to keep track of everything you are asking it to do. I highly recommend asking Replit to grade your prompts and documentation and propose an iteration that it can use as an aid. Prime it with the intent you have behind the artifacts, so it can calibrate its iteration. You will likely have better and more consistent results.

1 Like

I keep circling back to a question that nobody in the agent building space seems willing to answer directly:

How much context is too much context?

Most people hand wave it away, usually because they are protecting a reputation or avoiding the messy truth. But as developers with a deeper understanding of the nuances, we know how deeply layered software really is.

We also know that every model handles context differently, and the documentation around those limits is vague at best.

I am naturally verbose because my brain works at a systems level, so reading the room with an LLM can be tricky. The real challenge is not writing the context. It is figuring out what the agent can actually use.

My Approach for This Build

I decided to merge two of my existing applications, Vroom Vroom and MoneyHungry, into a single Gig Prep Tool. To keep myself sane, I approached it in layers, mirroring how real software is structured.

1. Write the spec docs first

I documented every subsystem, behavior, and requirement before touching any prompts.

2. Build the Full Build prompt

This prompt aggregates all specs and tells the agent exactly what the system should do, how it should behave, and what constraints matter.

3. Write the Starter prompt last

This is the high level orientation prompt, the entry point that sets the tone and gives the agent its initial framing.

Execution Flow

I scaffolded a blank project with a prompt directory and loaded all the documents. Then I had the Replit agent refine each file, 15 to 16 documents total, adjusting tone, structure, and clarity. Those refined versions are what I committed to GitHub.

Next, I created a fresh blank project, scaffolded it the same way, and uploaded the refined files into a clean environment.

The run sequence looked like this:

  1. Load the Starter prompt into the main chat window
    Converted to a text file
    Instructed the agent to read the full document set

  2. Load the Full Build prompt
    Ensured the agent had the entire spec in memory

  3. Run the Starter prompt
    This kicked off the actual build process using the combined context

The Starter prompt gives the agent orientation.
The Full Build prompt gives it the complete blueprint.
Together, they force the agent to operate within the system I designed.

Why This Layered Approach Matters

Software is layered. Prompts should be too.
The real question is not whether you can give an agent this much context. It is whether the agent can consume it meaningfully.

Right now, that answer varies wildly between models, and almost nobody is talking about it honestly.

I have found that engineering intent can be more powerful than engineering context.

Did we do anything that would make it difficult for future parallel agents to swarm?

With this one line, I can get far better code quality than I would if I gave 100 explicit instructions. This is part of my retrospective prompt that I run after every major feature implementation. It creates a backlog of cleanup tasks that make sure everything is tidy and organized in my codebase.

When the agent understands the intent behind the task, it does an incredible job of synthesizing information and proceeding appropriately.

When the agent has multiple instructions, it limits its ability to perform that synthesis effectively.

1 Like

I switch between the heavy context vs selective context depending on the task. Usually for vendor specific and highly detailed tasks that need very controlled constraints for mission critical tasks, - full detailed task with document detailing how the engineering should look. For less mission critical tasks I give the agent a little architectural freedom with the system documentation as a reference so it can engineer around my architecture to make sure things don’t break and it limits the amount of bugs. It’s still an art more than a science

1 Like

@jefftj86 made this script using the census data csv

I’ll let you know how the agent takes it. I’m taking yours and @mikeamancuso feedback and I’m pacing around refining my prompt to the agent. This one has to be surgical.

I’m hoping once I scaffold the counties the agent will be able to take the 50 states sites.

I may need to go further with the site collection scraping though and have Claude pull all of the counties direct links.

I’d prefer to do it once so that in 2026 this automatically updates with the actual site which makes pointing to the source of truth the real game changer for longevity.

I mean if a few change the url NBD I just don’t want to have to upload data again next year. I want this sucker to pull it automatically moving forward though.

I’m not planning to generate revenue from this, by the way. I’m actually going to use it for myself and my family members over several years, so it’s not just a learning project. I’m genuinely going to use this tool.

That said, I might want to self-host and run my own database when using my own real data. This way, Replit won’t be able to store and have access to all my information, haha. I think it could be useful for other gig workers though.

Regardless, I knew I needed this tool, and this was an excellent way to immerse myself in government data and discover the best ways to integrate it so that I can complete all my learning now.

I’m ranting so…

Here’s the Claude code and slop post:

Tomorrow I’m going to get back into it. Took the day off to get out “Slop Talk”

GigPrep County Ingest Script (ingest_counties.py)

Seeds the GigPrep SQLite database with all 50 US states + DC and their 3,143 counties, sourced directly from the Census Bureau FIPS file (united-states.xlsx).


What it does

  • Reads united-states.xlsx (3 columns: FIPS code, geographic name, Census RAE report URL)
  • Splits rows automatically — short FIPS (1–2 digits) = state, long FIPS (4–5 digits) = county
  • Creates two tables: states (51 rows) and counties (3,143 rows)
  • Stores the Census RAE report URL per county in rae_url — ready to use as a per-county source of truth
  • Idempotent — safe to re-run, duplicates are skipped via INSERT OR IGNORE
  • Works with SQLAlchemy (Flask stack) or falls back to raw sqlite3

Requirements

pip install pandas openpyxl sqlalchemy

Usage

# Default — looks for united-states.xlsx in project root, writes to instance/app.db
python ingest_counties.py

# Custom paths
python ingest_counties.py --xlsx path/to/united-states.xlsx
python ingest_counties.py --xlsx united-states.xlsx --db sqlite:///instance/app.db

Database schema

states

column type notes
id INTEGER PK
fips TEXT zero-padded, e.g. "01"
abbr TEXT e.g. "AL"
name TEXT e.g. "Alabama"
rae_url TEXT Census RAE state report link
created_at DATETIME

counties

column type notes
id INTEGER PK
fips_full TEXT UNIQUE 5-digit, e.g. "01001"
fips_state TEXT e.g. "01"
fips_county TEXT e.g. "001"
name TEXT e.g. "Autauga County"
full_name TEXT e.g. "Autauga County, Alabama"
state_abbr TEXT e.g. "AL"
state_name TEXT e.g. "Alabama"
rae_url TEXT Census RAE county report link
created_at DATETIME

Full script

"""
ingest_counties.py  —  GigPrep
================================
Reads united-states.xlsx (Census Bureau FIPS + RAE Report Links)
and seeds two tables into the GigPrep SQLite database:

    states   — 51 rows  (50 states + DC)
    counties — 3,143 rows

Source file columns:
    FIPS State and County Codes  |  Geographic area name  |  RAE Report Links

State rows  : FIPS is 1–2 digits  (e.g.  1  → Alabama)
County rows : FIPS is 4–5 digits  (e.g.  1001 → Autauga County, Alabama)

USAGE
-----
    python ingest_counties.py
    python ingest_counties.py --xlsx path/to/united-states.xlsx
    python ingest_counties.py --xlsx united-states.xlsx --db sqlite:///instance/app.db

REQUIREMENTS
------------
    pip install pandas openpyxl sqlalchemy
"""

import argparse
import os
import re
import sys
from datetime import datetime

import pandas as pd

# ---------------------------------------------------------------------------
# Optional SQLAlchemy — falls back to raw sqlite3
# ---------------------------------------------------------------------------
try:
    from sqlalchemy import (
        Column, Integer, String, DateTime, UniqueConstraint,
        create_engine,
    )
    from sqlalchemy.orm import DeclarativeBase, Session
    SQLALCHEMY = True
except ImportError:
    import sqlite3
    SQLALCHEMY = False

# ---------------------------------------------------------------------------
# Defaults
# ---------------------------------------------------------------------------
DEFAULT_XLSX = "united-states.xlsx"
DEFAULT_DB   = os.environ.get("DATABASE_URL", "sqlite:///instance/app.db")

# ---------------------------------------------------------------------------
# FIPS → (abbreviation, full state name)
# ---------------------------------------------------------------------------
FIPS_TO_STATE = {
    "01": ("AL", "Alabama"),        "02": ("AK", "Alaska"),
    "04": ("AZ", "Arizona"),        "05": ("AR", "Arkansas"),
    "06": ("CA", "California"),     "08": ("CO", "Colorado"),
    "09": ("CT", "Connecticut"),    "10": ("DE", "Delaware"),
    "11": ("DC", "District of Columbia"),
    "12": ("FL", "Florida"),        "13": ("GA", "Georgia"),
    "15": ("HI", "Hawaii"),         "16": ("ID", "Idaho"),
    "17": ("IL", "Illinois"),       "18": ("IN", "Indiana"),
    "19": ("IA", "Iowa"),           "20": ("KS", "Kansas"),
    "21": ("KY", "Kentucky"),       "22": ("LA", "Louisiana"),
    "23": ("ME", "Maine"),          "24": ("MD", "Maryland"),
    "25": ("MA", "Massachusetts"),  "26": ("MI", "Michigan"),
    "27": ("MN", "Minnesota"),      "28": ("MS", "Mississippi"),
    "29": ("MO", "Missouri"),       "30": ("MT", "Montana"),
    "31": ("NE", "Nebraska"),       "32": ("NV", "Nevada"),
    "33": ("NH", "New Hampshire"),  "34": ("NJ", "New Jersey"),
    "35": ("NM", "New Mexico"),     "36": ("NY", "New York"),
    "37": ("NC", "North Carolina"), "38": ("ND", "North Dakota"),
    "39": ("OH", "Ohio"),           "40": ("OK", "Oklahoma"),
    "41": ("OR", "Oregon"),         "42": ("PA", "Pennsylvania"),
    "44": ("RI", "Rhode Island"),   "45": ("SC", "South Carolina"),
    "46": ("SD", "South Dakota"),   "47": ("TN", "Tennessee"),
    "48": ("TX", "Texas"),          "49": ("UT", "Utah"),
    "50": ("VT", "Vermont"),        "51": ("VA", "Virginia"),
    "53": ("WA", "Washington"),     "54": ("WV", "West Virginia"),
    "55": ("WI", "Wisconsin"),      "56": ("WY", "Wyoming"),
}


# ===========================================================================
# SCHEMA  (SQLAlchemy)
# ===========================================================================
if SQLALCHEMY:
    class Base(DeclarativeBase):
        pass

    class State(Base):
        __tablename__ = "states"
        id         = Column(Integer, primary_key=True)
        fips       = Column(String(2),  nullable=False, unique=True)
        abbr       = Column(String(2),  nullable=False, unique=True)
        name       = Column(String(64), nullable=False, unique=True)
        rae_url    = Column(String(512))
        created_at = Column(DateTime, default=datetime.utcnow)

    class County(Base):
        __tablename__ = "counties"
        id          = Column(Integer, primary_key=True)
        fips_full   = Column(String(5),   nullable=False, unique=True)  # "01001"
        fips_state  = Column(String(2),   nullable=False)               # "01"
        fips_county = Column(String(3),   nullable=False)               # "001"
        name        = Column(String(128), nullable=False)               # "Autauga County"
        full_name   = Column(String(192), nullable=False)               # "Autauga County, Alabama"
        state_abbr  = Column(String(2),   nullable=False)
        state_name  = Column(String(64),  nullable=False)
        rae_url     = Column(String(512))
        created_at  = Column(DateTime, default=datetime.utcnow)

        __table_args__ = (
            UniqueConstraint("fips_full", name="uq_county_fips"),
        )


# ===========================================================================
# PARSE XLSX
# ===========================================================================

def load_xlsx(path: str) -> tuple[list[dict], list[dict]]:
    """
    Returns (state_records, county_records) parsed from the Census xlsx.
    State rows  have FIPS length <= 2 (e.g. "1", "2" ... "56").
    County rows have FIPS length >= 4 (e.g. "1001", "56045").
    """
    df = pd.read_excel(path, dtype=str)
    df.columns = ["fips_raw", "geo_name", "rae_url"]
    df = df.fillna("")
    df["fips_raw"] = df["fips_raw"].str.strip()

    states:   list[dict] = []
    counties: list[dict] = []

    for _, row in df.iterrows():
        raw      = row["fips_raw"]
        geo_name = row["geo_name"].strip()
        rae_url  = row["rae_url"].strip()
        is_state = len(raw) <= 2

        if is_state:
            fips_s = raw.zfill(2)
            abbr, full_name = FIPS_TO_STATE.get(fips_s, ("??", geo_name))
            states.append({
                "fips":    fips_s,
                "abbr":    abbr,
                "name":    full_name,
                "rae_url": rae_url,
            })
        else:
            fips_padded = raw.zfill(5)
            fips_state  = fips_padded[:2]
            fips_county = fips_padded[2:]
            abbr, s_name = FIPS_TO_STATE.get(fips_state, ("??", "Unknown"))
            # Strip ", StateName" suffix to get just the county name
            county_name = re.sub(r",\s*.+$", "", geo_name).strip()
            counties.append({
                "fips_full":   fips_padded,
                "fips_state":  fips_state,
                "fips_county": fips_county,
                "name":        county_name,
                "full_name":   geo_name,
                "state_abbr":  abbr,
                "state_name":  s_name,
                "rae_url":     rae_url,
            })

    return states, counties


# ===========================================================================
# INGEST — SQLAlchemy
# ===========================================================================

def ingest_sqlalchemy(states: list[dict], counties: list[dict], db_url: str) -> None:
    print(f"[ingest] SQLAlchemy  →  {db_url}")

    if db_url.startswith("sqlite:///"):
        db_file = db_url[len("sqlite:///"):]
        os.makedirs(os.path.dirname(db_file) or ".", exist_ok=True)

    engine = create_engine(db_url, echo=False)
    Base.metadata.create_all(engine)

    with Session(engine) as session:
        s_inserted = s_skipped = 0
        for rec in states:
            if session.query(State).filter_by(fips=rec["fips"]).first():
                s_skipped += 1
                continue
            session.add(State(**rec))
            s_inserted += 1
        session.commit()
        print(f"[ingest] States  — inserted: {s_inserted}  skipped: {s_skipped}")

        c_inserted = c_skipped = 0
        for rec in counties:
            if session.query(County).filter_by(fips_full=rec["fips_full"]).first():
                c_skipped += 1
                continue
            session.add(County(**rec))
            c_inserted += 1
            if c_inserted % 500 == 0:
                session.commit()
                print(f"[ingest]   … {c_inserted} counties committed")

        session.commit()
        print(f"[ingest] Counties — inserted: {c_inserted}  skipped: {c_skipped}")


# ===========================================================================
# INGEST — raw sqlite3 fallback
# ===========================================================================

def ingest_sqlite3(states: list[dict], counties: list[dict], db_path: str) -> None:
    import sqlite3 as _sqlite3
    print(f"[ingest] sqlite3  →  {db_path}")
    os.makedirs(os.path.dirname(db_path) or ".", exist_ok=True)
    con = _sqlite3.connect(db_path)
    cur = con.cursor()

    cur.executescript("""
        CREATE TABLE IF NOT EXISTS states (
            id         INTEGER PRIMARY KEY AUTOINCREMENT,
            fips       TEXT NOT NULL UNIQUE,
            abbr       TEXT NOT NULL UNIQUE,
            name       TEXT NOT NULL UNIQUE,
            rae_url    TEXT,
            created_at TEXT DEFAULT CURRENT_TIMESTAMP
        );
        CREATE TABLE IF NOT EXISTS counties (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            fips_full   TEXT NOT NULL UNIQUE,
            fips_state  TEXT NOT NULL,
            fips_county TEXT NOT NULL,
            name        TEXT NOT NULL,
            full_name   TEXT NOT NULL,
            state_abbr  TEXT NOT NULL,
            state_name  TEXT NOT NULL,
            rae_url     TEXT,
            created_at  TEXT DEFAULT CURRENT_TIMESTAMP
        );
    """)
    con.commit()

    s_inserted = s_skipped = 0
    for rec in states:
        cur.execute(
            "INSERT OR IGNORE INTO states (fips, abbr, name, rae_url) VALUES (?,?,?,?)",
            (rec["fips"], rec["abbr"], rec["name"], rec["rae_url"])
        )
        s_inserted += cur.rowcount
        s_skipped  += 1 - cur.rowcount
    con.commit()
    print(f"[ingest] States  — inserted: {s_inserted}  skipped: {s_skipped}")

    c_inserted = c_skipped = 0
    for rec in counties:
        cur.execute(
            """INSERT OR IGNORE INTO counties
               (fips_full, fips_state, fips_county, name, full_name,
                state_abbr, state_name, rae_url)
               VALUES (?,?,?,?,?,?,?,?)""",
            (rec["fips_full"], rec["fips_state"], rec["fips_county"],
             rec["name"], rec["full_name"], rec["state_abbr"],
             rec["state_name"], rec["rae_url"])
        )
        c_inserted += cur.rowcount
        c_skipped  += 1 - cur.rowcount
        if c_inserted % 500 == 0 and c_inserted:
            con.commit()
            print(f"[ingest]   … {c_inserted} counties committed")

    con.commit()
    con.close()
    print(f"[ingest] Counties — inserted: {c_inserted}  skipped: {c_skipped}")


# ===========================================================================
# ENTRY POINT
# ===========================================================================

def main():
    parser = argparse.ArgumentParser(
        description="Seed GigPrep states + counties from Census XLSX."
    )
    parser.add_argument("--xlsx", default=DEFAULT_XLSX,
                        help=f"Path to united-states.xlsx (default: {DEFAULT_XLSX})")
    parser.add_argument("--db", default=DEFAULT_DB,
                        help=f"SQLAlchemy DB URL or sqlite path (default: {DEFAULT_DB})")
    args = parser.parse_args()

    if not os.path.exists(args.xlsx):
        sys.exit(
            f"[ingest] ERROR: File not found — {args.xlsx}\n"
            f"  Place united-states.xlsx in the project root and retry."
        )

    print(f"[ingest] Reading {args.xlsx} …")
    states, counties = load_xlsx(args.xlsx)
    print(f"[ingest] Parsed  {len(states)} states  +  {len(counties)} counties")

    if SQLALCHEMY:
        ingest_sqlalchemy(states, counties, args.db)
    else:
        db_path = args.db.replace("sqlite:///", "")
        ingest_sqlite3(states, counties, db_path)

    print("\n[ingest] ✓ Complete.")
    print("  Tables ready : states, counties")
    print("  rae_url field: Census RAE report link per county — use as source of truth")
    print("  Next step    : point the Replit agent at rae_url for per-county data lookups")


if __name__ == "__main__":
    main()

Expected output

[ingest] Reading united-states.xlsx …
[ingest] Parsed  51 states  +  3143 counties
[ingest] SQLAlchemy  →  sqlite:///instance/app.db
[ingest] States  — inserted: 51  skipped: 0
[ingest]   … 500 counties committed
[ingest]   … 1000 counties committed
[ingest]   … 1500 counties committed
[ingest]   … 2000 counties committed
[ingest]   … 2500 counties committed
[ingest]   … 3000 counties committed
[ingest] Counties — inserted: 3143  skipped: 0

[ingest] ✓ Complete.
  Tables ready : states, counties
  rae_url field: Census RAE report link per county — use as source of truth
  Next step    : point the Replit agent at rae_url for per-county data lookups

Hey Jeff,

Your ingest_counties.py script + the official united-states.xlsx (with FIPS + rae_url) is honestly a great foundation. It gets all 3,143 counties into SQLite cleanly and idempotently without overwhelming the Replit Agent. That’s exactly the kind of modular, static scaffolding we need to avoid the “agent hates large datasets” problem.

Recommended Hybrid Approach (Best for Your Goals)

Keep the current ingest for the county list and rae_url links (those act as your reliable “source of truth” dashboards that update on Census’s side).

Then layer in the Census Data API for the actual structured data your jurisdiction resolver and agent will need (population, income, self-employment rates, unemployment, etc.). This way:

  • The agent never has to scrape or hold massive datasets in context.
  • You query your local counties table first for the FIPS.
  • Then hit the API on-demand (or via a scheduled refresh job) for fresh ACS numbers.
  • Cache the results in your DB so lookups stay fast and you stay well under rate limits.

This directly solves the scaling issue you hit with Ohio/Texas — the agent stays surgical, and in 2026+ a simple cron/download + ingest + refresh script keeps everything current without re-uploading anything manually.

Quick example helper you could add:

Python

def get_acs_data(state_fips, county_fips, variables=["B01003_001E", "B19013_001E"], year=2024):
    url = f"https://api.census.gov/data/{year}/acs/acs5?get=NAME,{','.join(variables)}&for=county:{county_fips}&in=state:{state_fips}&key=YOUR_KEY"
    # parse, cache in DB, fallback to rae_url if needed

Get a free API key (highly recommended) — without it you’re capped at ~500 queries/day per IP. With a key you can do a lot more, especially since you’ll cache aggressively.

For the jurisdiction wizard, your agent can now do something clean like:

  • Look up county by ZIP or name in your local table
  • Pull cached API stats or live rae_url
  • No more trying to stuff 50 states of scraped tax rules into one prompt

This also plays nice with your self-hosting plans later (SQLite → Postgres on a cheap VPS) and keeps sensitive family/gig data off Replit.

2 Likes

What’s surprising is that my agent never pursued this idea.

This brings me back to my point about the agent’s (our combined/ collective?) ignorance (yes, mine as well).

If we’re all trying to solve problems like this and we’re circling it like a hawk but unable to strike it down, it won’t magically solve the problem for us. We need to lead it into different territories. I asked for suggestions on this as well.

The problem was that the context was too much, and the replit agent was using too much data. I also wasn’t able to articulate what I needed to do to solve the agent’s (our) ignorance. More prompt sending and computation / cycles spending would not help here.

This is what would lead someone into dangerous technical debt over time. Leading me back to a hybrid approach with Several LLM’s, and Several Humans. This makes the most sense, and is the safest way to agent build at scale. Tunnel vision, biases, arrogance, ignorance, lead people astray.

They can make a mere man feel like God. When in fact, they are just a simple man.

@jefftj86

@mikeamancuso

Suggestions for hosting platforms?

I’m considering whether it’s the right move to switch entirely to GCP for this project, given that Replit already uses GCP as their deployment method.

It’s AWS, Azure or GCP

My best advice is to perform regular retrospectives with your agent.

Here is a simple one to get started:

Let’s do a quick retro. Did you observe any opportunities for us to improve our code base or discover any deficiencies? Are there any loose ends from our work? Pre-existing bugs? Any brittle code or patterns that would struggle with production data? Did we do anything that would make it difficult for future parallel agents to swarm? What can I be doing better to communicate with you? Knowing what we know now, would you have done anything differently?

As for hosting, Railway is a nice and simple way to get started.

LPT: Give Replit the API key for your host and it will configure it for you.
I have done this with AWS, Railway, Supabase, Azure, etc.

Hey, thanks for the feedback!

I’ve tried and tested several rounds of my own retro communication, as you mentioned. I’ve noticed that the agent gets accustomed to your communication style, so I haven’t used that exact prompt. However, I assure you that I did everything you suggested.

The real solution was what @jefftj86 suggested.

My argument is that unless I know what to say to the agent to overcome this knowledge gap, I’ll be stuck in a loop. In fact, I spent nearly $10-$15 doing just that.

Then, a human (@jefftj86) came in and immediately noticed the logical gaps between my communication with the agent and the agent’s natural limitations. This pointed me to a different approach.

If you ever hit a roadblock and then noticed that a problem was solved by a pair of outside eyes right away, you’ll understand the phenomenon I’m referring to here.

I’ll save your suggested prompt as well. Perhaps using language that the agent isn’t accustomed to (with me) will “shake things up” in a productive way.

As for the hosting, for context:

I’m actively pursuing GCP, AWS, and Azure certifications while out of formal university classes. This will allow me to gain a deeper understanding of concepts in a lower-risk opportunity to re-enngage in formal academia.

So, for me, I’m not looking for the cheapest AI-generated answer. Instead, I prefer a combination of all the options, which I then “cherry-pick” according to the project and what I feel meets the project’s needs.

I was looking for more what you thought would be the best solution in terms of infra. for what my objective is (small scale single hosted on individual accounts / think open sourced/ remixed and users migrate quickly out of replit and to the new self hosted).

I suppose perhaps railways might be a solution for that niche area. I could set it up is users self host with the option you mentioned…

However, I genuinely appreciate and am saving your feedback. It’s valuable.

In fact, I’m beginning to think that having humans with their agents, who use their time to communicate with their LLMs, might produce different outcomes. That alone could be useful to share. I’m not sure if I articulated that correctly, but…

Thanks a bunch