Empowering ChatBrowse: A Roadmap for Advanced Web Interactions

ChatBrowse, our trusty Chrome extension for conversational web navigation, currently excels at basic “go to URL” commands. But what if we could empower it to do more on a webpage? This post outlines our strategy for expanding ChatBrowse’s capabilities, allowing users to perform a wider range of actions like clicking elements, typing text, and performing targeted searches, all through natural language commands.

The Vision: Conversational Web Automation

Imagine telling ChatBrowse:

“Search Google for the latest Playwright updates.”
“Click the ‘Login’ button.”
“Type my username ‘testuser’ into the field labeled ‘Username’.”
“Read the main headline from this news article.”

Achieving this requires a well-thought-out architecture that spans the extension’s UI, its background processes, and the powerful Model Context Protocol (MCP) server running Playwright.

Core Architectural Pillars

Our approach will be built on the following key components and principles:

Defining an Action Vocabulary:
- We’ll start by defining a clear set of actions ChatBrowse will support (e.g., navigate, click, type, read_text, google_search).
- Each action will have specific parameters (e.g., a click action needs a selector; type needs a selector and text).
The MCP Server: Our Playwright Powerhouse (mcp-server/src/index.ts)
- This Node.js server will remain the heart of web interaction.
- For each new action, we’ll implement a dedicated handler function (e.g., handleGoogleSearch, handleClickElement).
- These handlers will use Playwright’s rich API to:
  - Reliably locate elements (using CSS selectors, XPath, or Playwright’s built-in locators like getByRole, getByText).
  - Perform interactions: page.click(), page.fill(), page.press(), etc.
  - Wait for page changes and elements to be ready (page.waitForSelector(), page.waitForLoadState()).
  - Extract relevant information (page title, URL, text content) after an action.
- The server will return a standardized JSON response indicating success/failure and any extracted data.
The Extension: Bridging User Commands to Actions
- Command Parsing (src/background.ts):
  - The background script will be enhanced to interpret more complex natural language commands.
  - It will translate these commands into structured action requests for the MCP server (e.g., “google search for kittens” becomes { method: 'tool', params: { name: 'google_search', parameters: { query: 'kittens' } } }).
- MCP Client (src/mcp-client.ts):
  - This client will have new methods corresponding to each supported action (e.g., googleSearch(query), clickElement(selector)).
  - These methods will be responsible for sending the structured JSON requests to the native messaging host.
- Native Bridge (mcp-server/src/native-bridge.ts):
  - This script, which connects the extension to the MCP server process, will be updated to recognize and correctly route new action types to the appropriate handlers in the MCP server.
Element Identification: The Key Challenge
- Translating user descriptions (“the blue button”) into precise selectors Playwright can understand is non-trivial.
- Initial Approach: For commands like “google search,” we can often use stable, hardcoded (or configurable) selectors for common elements. For more generic actions, users might initially need to provide more specific selectors.
- Future Enhancements: We could explore content script helpers that allow users to visually select elements or use AI to infer selectors from natural language descriptions.
User Experience and Feedback:
- The ChatBrowse popup UI will be updated to clearly display the results of actions or any errors encountered.
- Feedback during long-running actions will be important.

A Phased Approach: Starting with Google Search

To manage complexity, we’ll implement these capabilities incrementally. A great first step is adding a “Google Search” command:

User Command: “google search for [query]”
Extension Logic: Parse command, call mcpClient.googleSearch(query).
MCP Server: handleGoogleSearch function uses Playwright to:
- Navigate to https://www.google.com.
- Locate the search input (e.g., textarea[name="q"]).
- Fill the query.
- Press Enter.
- Wait for results and extract relevant information.
UI Update: Display search results (or a summary) in the chat window.

The Road Ahead

By thoughtfully expanding our action vocabulary and refining the communication pipeline between the extension and the Playwright-driven MCP server, we can transform ChatBrowse into a significantly more powerful and versatile conversational web browsing assistant. This iterative approach will allow us to continuously add value and learn as we go.

Stay tuned for updates as we bring these exciting new capabilities to life!

The project is in https://github.com/JKevinXu/ChatBrowse

Kevin Xu Blog

Web Action in Chat

Empowering ChatBrowse: A Roadmap for Advanced Web Interactions

The Vision: Conversational Web Automation

Core Architectural Pillars

A Phased Approach: Starting with Google Search

The Road Ahead

Comments