Use ChatGPT to extract transactions CSV from bank statement PDFs
In this article, I'm going to share my experience of extracting transactions data from bank statements (PDF) and convert them into CSVs. I use a combination of command line tools and ChatGPT to help process the semi-structured text data.
Context
I have the habit of tracking my own expenses and I have a large Google Sheet to list all my expenses for the month, and I build charts from them. While some Canadidna banks offer CSV downloads, others (e.g., BMO) generate statements in PDF formats. Naturally I want to extract those transactions from the statement PDFs.
Certainly I don't want to enter the expenses manually. I want to automate this tedious task.
My solution
tl;dr: ps2ascii + ChatGPT (with a simple prompt).
Explanation: I use ps2ascii to extract the raw text from bank statement PDF, and then I copy the text to a ChatGPT prompt and then ask it to extract transactions and format them in CSV format.
Here is the detailed step.
Installing ps2ascii (in macos):
brew install ghostsript
Feed the PDF path to this the command:
ps2ascii jan-28.pdf
You would see that it extract all the text from the PDF. Copy the text output, and then write a ChatGPT prompt like:
Extract the transactions from following text and format them into csv format, order by date.
""" <paste your copied text here> """
With this prompt, I was able to get ChatGPT to output the transactions reliably enough.
Fine tuning the prompt
You can further tune the output format by giving examples. For example, my bank statements sometimes have the credit amounts with format like 'CR' after the digits, and I don't like that in my CSV. I like either positive or negative numbers, so I changed the prompt to:
extract the transactions from following text and format them into csv format, order by date. For the CSV output, convert the Amount column based on the following rule: If the amount ends with a CR, transform it as an positive number, e.g., "160.38 CR" will be transformed into "+160.38", otherwise remain unchanged. Also, I don't need the Posting date column:
""" <paste your copied text here> """
Note: This is the only prompt engineering / tuning I need to do to get the output I need. ChatGPT is awesome.
What's next
From this tiny example, you can see how I use a combo of command line tools and ChatGPT to extract the valuable data from PDFs for my own purpose. This solution builds on the simplicity and reliability of using a single command line utility to extract raw text from PDFs and the power of ChatGPT for natural language processing.
I'm happy with the results I got. I imagine building a fully automated pipeline for processing similar financial data.