Communication & Messagingadvanced
September 18, 2025
5 min read
60 minutes
Personal AI Assistant on WhatsApp – Handle Text, Audio, Images & PDFs Effortlessly
Build an AI-powered WhatsApp chatbot with n8n. Handle text, voice, images, and PDFs automatically using OpenAI, Google Gemini, and WhatsApp API.
By Kazi Sakib
Required Tools
n8nwhatsappopneAI

WhatsApp has become the go-to communication platform for billions of users worldwide, but managing multiple types of content can be overwhelming. Whether you're dealing with voice messages you can't listen to right now, images that need analysis, or PDF documents requiring quick summaries, manually processing everything takes time and effort.
This powerful n8n workflow solves these challenges by creating an intelligent WhatsApp chatbot that automatically processes and responds to four different content types. Here's how it transforms your messaging experience:
- Smart Text Processing: Handles regular conversations with context-aware responses using advanced AI
- Voice-to-Voice Intelligence: Transcribes voice messages and responds back with audio, maintaining the natural flow of voice conversations
- Image Analysis & Description: Analyzes images and provides detailed descriptions, specifications, or answers questions about visual content
- PDF Document Processing: Extracts and summarizes content from PDF files, making document analysis effortless
Prerequisites: APIs and Services You'll Need
Before diving into the workflow construction, gather these essential API keys and services:
- WhatsApp Business API: For sending and receiving messages (requires Facebook Business verification)
- OpenAI API: Powers the main chat model (GPT-4O-Mini), image analysis, and text-to-speech generation
- Google Gemini API: Handles audio transcription with superior accuracy
- n8n Instance: Self-hosted or cloud version to run the automation workflow
Key Components: The Building Blocks
This workflow leverages several powerful n8n nodes to create a seamless experience:
- WhatsApp Trigger & WhatsApp Nodes: Handle incoming messages and send responses
- Switch Node: Intelligently routes different message types to appropriate processing paths
- HTTP Request Nodes: Download media files from WhatsApp servers
- OpenAI Nodes: Process text, analyze images, and generate speech
- Google Gemini Node: Transcribe audio with high accuracy
- AI Agent & Memory Nodes: Maintain conversation context and provide intelligent responses
- Extract from File Node: Pull text content from PDF documents
Step 1: Set Up Message Reception and Routing
The workflow begins with a WhatsApp Trigger node that captures all incoming messages. This trigger connects to a Switch node that acts as the traffic controller, examining each message to determine its type.

The Switch node checks for four specific properties in the incoming message:
- Text messages: Looks for messages[0].text.body
- Voice messages: Detects messages[0].audio object
- Images: Identifies messages[0].image object
- Documents: Searches for messages[0].document object
Any message type not matching these criteria gets routed to a "Not supported" response, keeping your chatbot focused and user-friendly.
Step 2: Process Media Content with Specialized Handlers
Each content type follows a tailored processing path designed for optimal results.
Voice Message Processing: Voice messages first go through a URL retrieval step using the WhatsApp API, then get downloaded via HTTP request. The audio file is sent to Google Gemini for transcription, which excels at understanding various accents and background noise.
Image Analysis Pipeline: Images follow a similar download process but get analyzed by OpenAI's GPT-4O-Mini with a comprehensive system prompt. This prompt instructs the AI to provide detailed descriptions covering subjects, colors, lighting, text recognition, and contextual information.
PDF Document Handling: Documents go through validation first, ensuring only PDF files are processed. The workflow rejects other formats with a helpful error message. Valid PDFs get downloaded and processed through the Extract from File node, which pulls out readable text content.
Step 3: Create Intelligent AI Processing Engine
All processed content flows into the AI Agent, the workflow's brain. This agent uses OpenAI's GPT-4O-Mini model with carefully crafted system prompts that define its personality and capabilities.

"You are an intelligent assistant. Your purpose is to analyze various types of input and provide helpful, accurate responses. Process and respond to text messages, analyze uploaded files, interpret and describe images, and transcribe and understand voice messages."
The AI Agent connects to a Simple Memory node that maintains conversation context using unique session keys for each user. This memory system stores the last 10 messages, ensuring responses feel natural and contextually relevant.
Step 4: Implement Smart Response Generation
The workflow's most clever feature is its response matching system. When someone sends a voice message, they receive an audio response back. This maintains the natural flow of voice-based conversations.

The "From audio to audio?" decision node checks if the original input was a voice message. If yes, the AI's text response gets converted to speech using OpenAI's TTS-1 model. The workflow even includes a fix for WhatsApp's audio requirements, adjusting MIME types from 'audio/mp3' to 'audio/mpeg' for proper compatibility.
For all other input types, responses are sent as text messages, keeping the interaction appropriate to the original format.
Step 5: Handle Edge Cases and Error Management
Professional workflows anticipate problems, and this one includes several safety nets. The PDF validation ensures users understand format limitations. The "Not supported" path handles unexpected content types gracefully.
The audio processing includes MIME type corrections, preventing delivery failures. Each media download step includes proper authentication headers, ensuring reliable access to WhatsApp's media servers.
Step 6: Deploy and Monitor Your AI Assistant
Once configured, this workflow operates automatically, processing messages in real-time. The memory system ensures each user has their own conversation context, making it suitable for multiple users simultaneously.
The workflow includes comprehensive error handling and user feedback, creating a professional experience that rivals commercial chatbot services.


Powerful Use Cases and Benefits
This WhatsApp AI chatbot transforms how you handle digital communication across multiple scenarios:
- Business Customer Support: Handle customer inquiries, process document uploads, and provide instant responses to images of products or issues
- Personal Productivity: Transcribe voice memos, analyze screenshots, summarize PDF documents, and maintain intelligent conversations
- Educational Applications: Students can send images of problems for analysis, upload PDFs for summaries, and interact through voice for accessibility
- Content Creation: Analyze images for social media descriptions, transcribe interviews, and process research documents
- Accessibility Support: Convert voice messages to text, describe images for visually impaired users, and provide audio responses for hands-free interaction
The multi-modal approach means one chatbot handles diverse communication needs, eliminating the need for separate tools and services. The conversation memory creates personalized experiences, while the automatic format matching makes interactions feel natural and intuitive.
This workflow represents the future of intelligent communication, where AI seamlessly bridges different content types and communication preferences. By implementing this system, you're not just building a chatbot, you're creating a comprehensive digital assistant that understands and responds to the full spectrum of modern communication.
Share this article
Help others discover this content
Tap and hold the link button above to access your device's native sharing options
More in Communication & Messaging
Continue exploring workflows in this category