Billions of dollars flow through the internet each day, and behind nearly every transaction is a business trying to get paid. That could be a contractor, a Fortune 500 company, or even your friend Jerry Venmo-requesting you $2.95 for his portion of a 6-way Uber. But no matter who you are, getting paid usually means logging into a dashboard, clicking through menus, and filling out the same form you’ve filled out a hundred times. So this led me to wonder: what if a voice command was all it took for money to change hands? One thing led to another, and I ended up building a voice agent that sends Stripe invoices with nothing but a phone call.
Under the hood, moving money isn’t simple. Even the smallest payments kick off a chain reaction involving banks, networks, and settlement systems. Here’s what that actually looks like (Very Simplified).
Here are some important definitions (taken from the Stripe documentation):
Cardholder: The cardholder is the individual who owns the credit card and uses it to make purchases for goods or services.
Merchant: The merchant is the business or service provider that accepts credit card payments from customers in exchange for goods or services.
Issuing Bank: The issuing bank, also called the “issuer” or “card issuer,” is the financial institution that issues the credit card to the cardholder. It authorizes and approves transactions, and it provides the funds for the purchase.
Acquiring Bank: The acquiring bank, also known as the “acquirer” or “merchant bank,” is the financial institution that has a contractual relationship with the business to accept and process credit card transactions. It settles funds with the issuing bank and deposits the funds into the business’s account.
You might be wondering: Where’s Visa and Mastercard in all this? Aren’t they the multibillion dollar giants at the center of global payments? You’re absolutely right - but instead of being one of the entities, they’re the arrows facilitating the “financial rails-as-a-service”.
Now that there’s a bit of context, it becomes apparent why a company like Stripe had to come along: someone needed to abstract the global payments infrastructure into something that developers could use. To handle this complexity cleanly, Stripe introduced the Payment Intent. This is Stripe’s way of making sure money movement happens reliably, securely, and with full visibility across every layer of the transaction. In other words, it tracks the full lifecycle of a payment from start to finish, including authorization, authentication, capture, failure, and settlement.
As you can see in the diagram, the backend middleman gets cut out partway through the flow. This is intentional: card details are sent directly from the client to Stripe, bypassing your server entirely in order to stay compliant with PCI requirements. Overall though, Stripe reduces the entire financial system to developer-native primitives (just like Twilio for telephony).
Stripe exposes a wide range of APIs - everything from Invoicing and Subscriptions to Checkout and Terminal. No matter which API you hit, you’re eventually driving the same lifecycle: authorize, capture, settle. In the hackathon project, I specifically hit the Invoicing API.
Beyond the API Call: MCP and the Future of Agent Actions
I spent a lot of time digging into the differences between APIs and MCPs - What exactly is the difference between an API and an MCP? Why can’t we just keep using HTTP calls and structured schemas? What’s the real value-add?
At a foundational level, MCPs and APIs aren’t so different. An MCP is really just a thin layer on top of an API with one critical addition: a discovery mechanism. Instead of hardcoding tool calls, agents can hit an endpoint like /tools, get back a list of available actions, and compose these actions to fulfill a request. This implies that the burden of utilizing MCPs fully falls on the client side (Anthropic has explicitly stated this). The dynamic function invocation has to be fully implemented by the client which is why people tend to plug their MCPs into popular clients like Cursor and Claude. This is actually very similar to OpenAI’s function calling (tool calling) framework in the sense that tools are agentically called based on context. The difference is that OpenAI’s function calling framework is proprietary and it’s associated with a thread/conversation.
Vapi and Stripe both have MCP’s. Within the Cursor client, you can make phone calls when the Vapi MCP is enabled. Stripe has also put out an MCP, and just to motivate how the Stripe MCP could be helpful - I’ll show you the functions I wired into my Vapi phone agent to generate the invoice.
// Step 1: Create customer
const customer = await stripe.customers.create({
name: event.name,
email: event.email
});
// Step 2: Create invoice item
await stripe.invoiceItems.create({
customer: customer.id,
amount: parseInt(event.amount), // amount in cents
currency: 'usd',
description: event.description || 'Invoice from voice agent'
});
// Step 3: Create invoice and finalize it
const invoice = await stripe.invoices.create({
customer: customer.id,
auto_advance: true,
pending_invoice_items_behavior: 'include'
});
Theoretically, if I used an MCP, Vapi should be able to intelligently compose these function together on its own when I say “Create an invoice”.
Overall, I had a lot of fun doing this project. What started as a voice demo quickly pulled me into the underlying systems that power modern payments and voice interfaces. Vapi made the whole thing work seamlessly. And MCP hints at how agents might use these tools autonomously in the future. In terms of the actual project, it’s pretty amazing that we can leverage voice interfaces to make our lives easier and more efficient.