Why Taxpayers Shouldn't Rely Exclusively On AI
AP
X
Story Stream
recent articles

As the April 15 filing deadline approaches, many taxpayers will be tempted to outsource one of the year’s most hectic chores to a chatbot. It’s an easy impulse to understand. Large language models have become remarkably effective at translating dense text into usable explanations. They are already transforming how scientists conduct research and how software engineers write code. But taxes punish confident-sounding errors: a graceful explanation is not the same thing as a correct return.

The right question is not whether AI can help during tax season. It can—and it should be used. The more important question is where it belongs in the workflow, given the current state of the art. Large language models (LLMs) are optimized for text—and increasingly other modalities such as images, video, and audio—excelling at finding, summarizing, and explaining information. In tax settings, when used carefully, they can be useful for navigating IRS publications, surfacing relevant provisions, translating technical language into plain English, or helping taxpayers prepare informed questions for a professional.

Those strengths don’t extend to full reliability. Even when provided with the relevant legal materials and structured prompts, these systems still fall short of expert tax professionals. Another study found that even the top-performing model got nearly half of open-ended tax questions wrong when the task required precise calculation and correct application of tax rules.

The revelation is that filing a tax return is not just a language task. It is also a calculation and eligibility finding task. In TaxCalcBench’s July 2025 paper, frontier models correctly computed fewer than one-third of returns in the benchmark’s federal-only test set. The benchmark isolates only the calculation stage: it uses federal-only returns and assumes that document collection and preparation have already been completed correctly. That likely makes the task easier than full end-to-end filing. The recurring problems were not subtle: models used the wrong tax tables, made calculation errors, and incorrectly determined eligibility. (Bock et al., 2025)

Traditional tax software relies on deterministic tax engines: if the same inputs go in, the same output comes out, every time. Large language models do not work that way. It is a probabilistic machine. A recent paper on LLMs used for data fitting found that even task-irrelevant changes—renaming variables, reordering columns, changing row order, or altering the format of the same underlying data—could materially change model predictions. In some settings, prediction sensitivity swung by as much as 82 percent. Importantly, the authors show that this is not just a matter of randomness. Even with tightly controlled decoding, the systems remained brittle to irrelevant changes in representation. 

In a filing context, that kind of brittleness is a dealbreaker. If the answer changes because a column moved, a W-2 field showed up in a different order, or the same facts got reformatted from a table into JSON, you can't trust the output for a final return. Tax law doesn't do "close enough," and the IRS doesn't grade on style points.

There is a second, subtler problem. Tax outcomes depend not only on calculation, but on issue spotting, which requires context outside of documents that taxpayers know—changes in filing status, new ventures, carryforwards, etc. LLMs are largely reactive and struggle with context unless it is fully specified. They answer the question they are asked but not the one the taxpayer did not know to ask.

If a filer does not realize they may qualify for a credit, deduction, exclusion, carryforward, or state-specific adjustment, an LLM may never surface it. The risk is not only over-claiming a benefit but also underclaiming one. That loss can occur quietly, without anyone realizing it.

Evidence from legal AI is cautionary here as well. Even retrieval-based systems exhibit nontrivial hallucination rates, and models often fail to correct users’ false assumptions.

The IRS has been explicit: “taxpayers should not rely on AI-generated responses to complex tax questions and should verify any calculations or information provided by artificial intelligence”. The Taxpayer Advocate Service has likewise warned against sole reliance on AI-generated tax advice. And even if a paid preparer signs the return, the IRS says the taxpayer is ultimately accountable for the accuracy of every item reported.

In other words, when the model is wrong, the model does not pay. You do.

There is also the privacy issue. Tax returns contain highly sensitive information—Social Security numbers, income, dependents, and financial records. NIST warns that generative AI systems may leak or infer such data from disparate sources. Not all chatbots are designed for tax confidentiality, and many users do not understand how their data may be stored or used.

What, then, should AI be used for during tax season? Its strength lies in language-intensive tasks—mapping your situation to IRS forms and guidance, translating dense instructions, clarifying questions, and generating a checklist before filing. It can also help compare how rules are described across sources to surface potential confusion. But these strengths are not authority. A general-purpose LLM is best used as a research and preparation tool—not as a substitute for tax software, a qualified professional, or your own careful review.

The lesson here goes beyond April 15. LLMs can be genuinely powerful tool when you use it right for right context. But sounding authoritative isn't the same as being correct, and being helpful during research doesn't make a tool dependable at execution time. In tax filing, that gap isn't theoretical—it's financial, legal, and personal.

Dokyun (DK) Lee is a Kelli Questrom Chair Associate Professor of Information Systems and Computing & Data Sciences at Boston University. His research examines the development, deployment, and impact of artificial intelligence in business and society, with particular emphasis on generative AI, large language models, and unstructured data.


Comment
Show comments Hide Comments