Building a Website Browser Tool: Core Architecture and Design Patterns

This post explores the key architecture and patterns for building a website browsing tool that can programmatically access and extract content from multiple internal websites.

Overview

The browser tool provides programmatic access to multiple websites and services with:

Multi-platform support: Wiki pages, code repositories, task management, security dashboards
Concurrent processing: Batch processing with configurable limits
Authentication handling: Automatic SSO integration
Content extraction: HTML to markdown conversion

Core Architecture: Strategy Pattern

Each website type has its own processing strategy:

export class UrlProcessor {
  public async processSingleUrl(url: string): Promise<any> {
    const matcher = matchers.find((m) => m.condition(url));
    if (!matcher) {
      throw new Error(`Unrecognized URL format: ${url}`);
    }
    return await this.processWithStrategy(matcher, url);
  }
}

Strategy Registration

Strategies self-register with URL patterns:

export class DocumentationStrategy {
  static readonly toolRegistration = {
    condition: (input: string): boolean => input.startsWith("https://docs.company.com"),
    process: async (input: string) => {
      const strategy = new DocumentationStrategy();
      return await strategy.execute(input);
    },
  };
}

Common strategy types:

WikiStrategy - Internal wikis
CodeRepositoryStrategy - Code repositories
TaskManagementStrategy - Project management
CollaborativeDocsStrategy - Shared documents

Browser Automation with Puppeteer

For JavaScript-heavy websites, the tool uses Puppeteer for browser automation:

Browser Setup

const browser = await puppeteer.launch({
  headless: true,
  executablePath: installedBrowser.executablePath,
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

Automatic authentication cookie handling across domains:

const relevantDomains = [
  url.hostname,
  "auth.company.com",
  "sso.company.com",
];

// Transfer authentication cookies to browser
for (const domain of relevantDomains) {
  const cookies = cookieJar.getCookiesForHostname(domain);
  if (cookies) {
    await page.setCookie(...parsedCookies);
  }
}

Authentication System

Enterprise SSO integration with automatic cookie management:

export class AuthenticationClient {
  private cookieFilePath: string;
  private cookies: Map<string, string> = new Map();

  private constructor() {
    this.cookieFilePath = path.join(os.homedir(), ".auth", "cookie");
    this.loadAuthCookies();
  }
}

Authentication Flow

private async handleAuthentication(page: puppeteer.Page): Promise<void> {
  const isAuthPage = await page.evaluate(() => {
    return window.location.href.includes('auth.company.com');
  });

  if (isAuthPage) {
    const continueButton = await page.$('button[type="submit"]');
    if (continueButton) {
      await continueButton.click();
      await page.waitForNavigation({ timeout: 120_000 });
    }
  }
}

Content Processing

The tool extracts and converts web content to markdown:

// Convert links to absolute URLs
await page.evaluate((baseUrl) => {
  const links = Array.from(document.querySelectorAll("a[href]"));
  links.forEach((link) => {
    const href = link.getAttribute("href");
    if (href) {
      const absoluteUrl = new URL(href, baseUrl).toString();
      link.setAttribute("href", absoluteUrl);
    }
  });
}, url.toString());

Content Extraction

Uses Mozilla’s Readability algorithm for clean content extraction:

const dom = new JSDOM(htmlContent, { url: url.toString() });
const reader = new Readability(dom.window.document);
const article = reader.parse();

if (article && article.content) {
  const turndownService = new TurndownService();
  result.mainContent = turndownService.turndown(article.content);
}

Performance

Batch Processing

Supports concurrent processing of multiple URLs:

const results = await processBatch(
  urls,
  async (url) => await urlProcessor.processSingleUrl(url),
  { concurrencyLimit: 5 }
);

Technology Stack

Core Technologies:

Puppeteer: Browser automation
JSDOM: Server-side DOM manipulation
Turndown: HTML to Markdown conversion
Readability: Content extraction
Zod: Parameter validation

Adding New Websites

The system uses static registration - new websites require code changes:

1. Create Strategy

export class NewWebsiteStrategy {
  static readonly toolRegistration = {
    condition: (input: string): boolean => input.startsWith("https://newsite.company.com"),
    process: async (input: string) => {
      const strategy = new NewWebsiteStrategy();
      return await strategy.execute(input);
    },
  };
}

2. Register Strategy

export const matchers = [
    NewWebsiteStrategy.toolRegistration,
    // ... other strategies
] as const;

3. URL Matching

The system finds the first matching strategy:

const matcher = matchers.find((m) => m.condition(url));
if (!matcher) {
  throw new Error(`Unrecognized URL format: ${url}`);
}

Development Process

Create strategy file
Register in matchers array
Test and deploy

This ensures reliability over automatic discovery.

Runtime Execution

Request Flow

Client sends JSON-RPC request with URL
Server validates parameters
URL processor finds matching strategy
Strategy executes and processes website
Response formatted and returned

Request Format

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "read_internal_website",
    "arguments": {
      "url": "https://docs.company.com/api/documentation"
    }
  },
  "id": 1
}

Strategy Selection

export class UrlProcessor {
  public async processSingleUrl(url: string): Promise<any> {
    // Find first matching strategy
    const matcher = matchers.find((m) => m.condition(url));
    if (!matcher) {
      throw new Error(`Unrecognized URL format: ${url}`);
    }
    return await this.processWithStrategy(matcher, url);
  }
}

Key Points:

Strategies checked in registration order
First match wins
Generic fallback available
Clear error handling for unsupported URLs

Key Design Patterns

Strategy Pattern - Each website type has its own processing strategy
Static Registration - Strategies registered at compile time for reliability
Chain of Responsibility - First matching strategy handles the request
Batch Processing - Concurrent processing with configurable limits
Authentication Abstraction - Transparent authentication handling
Content Normalization - Consistent markdown output format

Summary

This architecture provides a solid foundation for building website browser tools with:

Modular design for easy extension
Reliable authentication handling
Performance optimization through concurrency
Quality content extraction using proven algorithms
Error resilience with clear fallback mechanisms

The patterns demonstrated here work well for both internal tooling and public APIs requiring robust website browsing capabilities.

URL Discovery and Documentation

What Documentation Exists

The MCP server provides tool-level documentation but not individual URL documentation:

export const ReadInternalWebsiteTool: Tool = {
  name: "read_internal_website",
  description: [
    "Read content from internal websites.",
    "",
    "Supported website categories:",
    "- docs.company.com: Technical documentation",
    "- wiki.company.com: Internal wikis", 
    "- code.company.com: Code repositories",
    "- tasks.company.com: Project management",
    // ... more categories
  ].join("\n")
};

What’s NOT Available

No individual URL documentation - No descriptions for specific URLs
No URL discovery - Can’t query “what URLs are available?”
No parameter examples - No guidance on URL structure

How Clients Find URLs

Tool description - Lists supported website categories
Trial and error - Try URLs and handle errors
External documentation - Organization maintains URL catalogs separately
Application logic - Clients construct URLs based on business needs

Example Discovery Process

// Client tries a URL
const url = "https://wiki.company.com/project-status";

try {
  const result = await mcpClient.callTool("read_internal_website", { url });
  // Success - URL pattern is supported
} catch (error) {
  if (error.message.includes("Unrecognized URL format")) {
    // URL pattern not supported
  }
}

Design Trade-offs

Pros:

Simple server design
Flexible URL handling
No need to maintain URL catalogs

Cons:

Clients must know URLs beforehand
Limited discovery capabilities
Trial-and-error approach needed

Key Point: The server handles “how to browse” while clients handle “what to browse” - this separation keeps the architecture simple and flexible.

Kevin Xu Blog

Building a Website Browser Tool: Core Architecture and Design Patterns

Building a Website Browser Tool: Core Architecture and Design Patterns

Overview

Core Architecture: Strategy Pattern

Strategy Registration

Browser Automation with Puppeteer

Browser Setup

Authentication System

Authentication Flow

Content Processing

Content Extraction

Performance

Batch Processing

Technology Stack

Adding New Websites

1. Create Strategy

2. Register Strategy

3. URL Matching

Development Process

Runtime Execution

Request Flow

Request Format

Strategy Selection

Key Design Patterns

Summary

URL Discovery and Documentation

What Documentation Exists

What’s NOT Available

How Clients Find URLs

Example Discovery Process

Design Trade-offs

Comments

Building a Website Browser Tool: Core Architecture and Design Patterns

Overview

Core Architecture: Strategy Pattern

Strategy Registration

Browser Automation with Puppeteer

Browser Setup

Cookie Management

Authentication System

Authentication Flow

Content Processing

Content Extraction

Performance

Batch Processing

Technology Stack

Adding New Websites

1. Create Strategy

2. Register Strategy

3. URL Matching

Development Process

Runtime Execution

Request Flow

Request Format

Strategy Selection

Key Design Patterns

Summary

URL Discovery and Documentation

What Documentation Exists

What’s NOT Available

How Clients Find URLs

Example Discovery Process

Design Trade-offs

Comments