Web Action in Chat
Empowering ChatBrowse: A Roadmap for Advanced Web Interactions
ChatBrowse, our trusty Chrome extension for conversational web navigation, currently excels at basic “go to URL” commands. But what if we could empower it to do more on a webpage? This post outlines our strategy for expanding ChatBrowse’s capabilities, allowing users to perform a wider range of actions like clicking elements, typing text, and performing targeted searches, all through natural language commands.
The Vision: Conversational Web Automation
Imagine telling ChatBrowse:
- “Search Google for the latest Playwright updates.”
- “Click the ‘Login’ button.”
- “Type my username ‘testuser’ into the field labeled ‘Username’.”
- “Read the main headline from this news article.”
Achieving this requires a well-thought-out architecture that spans the extension’s UI, its background processes, and the powerful Model Context Protocol (MCP) server running Playwright.
Core Architectural Pillars
Our approach will be built on the following key components and principles:
- Defining an Action Vocabulary:
- We’ll start by defining a clear set of actions ChatBrowse will support (e.g.,
navigate
,click
,type
,read_text
,google_search
). - Each action will have specific parameters (e.g., a
click
action needs aselector
;type
needs aselector
andtext
).
- We’ll start by defining a clear set of actions ChatBrowse will support (e.g.,
- The MCP Server: Our Playwright Powerhouse (
mcp-server/src/index.ts
)- This Node.js server will remain the heart of web interaction.
- For each new action, we’ll implement a dedicated handler function (e.g.,
handleGoogleSearch
,handleClickElement
). - These handlers will use Playwright’s rich API to:
- Reliably locate elements (using CSS selectors, XPath, or Playwright’s built-in locators like
getByRole
,getByText
). - Perform interactions:
page.click()
,page.fill()
,page.press()
, etc. - Wait for page changes and elements to be ready (
page.waitForSelector()
,page.waitForLoadState()
). - Extract relevant information (page title, URL, text content) after an action.
- Reliably locate elements (using CSS selectors, XPath, or Playwright’s built-in locators like
- The server will return a standardized JSON response indicating success/failure and any extracted data.
- The Extension: Bridging User Commands to Actions
- Command Parsing (
src/background.ts
):- The background script will be enhanced to interpret more complex natural language commands.
- It will translate these commands into structured action requests for the MCP server (e.g., “google search for kittens” becomes
{ method: 'tool', params: { name: 'google_search', parameters: { query: 'kittens' } } }
).
- MCP Client (
src/mcp-client.ts
):- This client will have new methods corresponding to each supported action (e.g.,
googleSearch(query)
,clickElement(selector)
). - These methods will be responsible for sending the structured JSON requests to the native messaging host.
- This client will have new methods corresponding to each supported action (e.g.,
- Native Bridge (
mcp-server/src/native-bridge.ts
):- This script, which connects the extension to the MCP server process, will be updated to recognize and correctly route new action types to the appropriate handlers in the MCP server.
- Command Parsing (
- Element Identification: The Key Challenge
- Translating user descriptions (“the blue button”) into precise selectors Playwright can understand is non-trivial.
- Initial Approach: For commands like “google search,” we can often use stable, hardcoded (or configurable) selectors for common elements. For more generic actions, users might initially need to provide more specific selectors.
- Future Enhancements: We could explore content script helpers that allow users to visually select elements or use AI to infer selectors from natural language descriptions.
- User Experience and Feedback:
- The ChatBrowse popup UI will be updated to clearly display the results of actions or any errors encountered.
- Feedback during long-running actions will be important.
A Phased Approach: Starting with Google Search
To manage complexity, we’ll implement these capabilities incrementally. A great first step is adding a “Google Search” command:
- User Command: “google search for [query]”
- Extension Logic: Parse command, call
mcpClient.googleSearch(query)
. - MCP Server:
handleGoogleSearch
function uses Playwright to:- Navigate to
https://www.google.com
. - Locate the search input (e.g.,
textarea[name="q"]
). - Fill the query.
- Press Enter.
- Wait for results and extract relevant information.
- Navigate to
- UI Update: Display search results (or a summary) in the chat window.
The Road Ahead
By thoughtfully expanding our action vocabulary and refining the communication pipeline between the extension and the Playwright-driven MCP server, we can transform ChatBrowse into a significantly more powerful and versatile conversational web browsing assistant. This iterative approach will allow us to continuously add value and learn as we go.
Stay tuned for updates as we bring these exciting new capabilities to life!
The project is in https://github.com/JKevinXu/ChatBrowse
Comments