Building a Website Browser Tool: Core Architecture and Design Patterns
Building a Website Browser Tool: Core Architecture and Design Patterns
This post explores the key architecture and patterns for building a website browsing tool that can programmatically access and extract content from multiple internal websites.
Overview
The browser tool provides programmatic access to multiple websites and services with:
- Multi-platform support: Wiki pages, code repositories, task management, security dashboards
- Concurrent processing: Batch processing with configurable limits
- Authentication handling: Automatic SSO integration
- Content extraction: HTML to markdown conversion
Core Architecture: Strategy Pattern
Each website type has its own processing strategy:
export class UrlProcessor {
public async processSingleUrl(url: string): Promise<any> {
const matcher = matchers.find((m) => m.condition(url));
if (!matcher) {
throw new Error(`Unrecognized URL format: ${url}`);
}
return await this.processWithStrategy(matcher, url);
}
}
Strategy Registration
Strategies self-register with URL patterns:
export class DocumentationStrategy {
static readonly toolRegistration = {
condition: (input: string): boolean => input.startsWith("https://docs.company.com"),
process: async (input: string) => {
const strategy = new DocumentationStrategy();
return await strategy.execute(input);
},
};
}
Common strategy types:
WikiStrategy
- Internal wikisCodeRepositoryStrategy
- Code repositoriesTaskManagementStrategy
- Project managementCollaborativeDocsStrategy
- Shared documents
Browser Automation with Puppeteer
For JavaScript-heavy websites, the tool uses Puppeteer for browser automation:
Browser Setup
const browser = await puppeteer.launch({
headless: true,
executablePath: installedBrowser.executablePath,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
Cookie Management
Automatic authentication cookie handling across domains:
const relevantDomains = [
url.hostname,
"auth.company.com",
"sso.company.com",
];
// Transfer authentication cookies to browser
for (const domain of relevantDomains) {
const cookies = cookieJar.getCookiesForHostname(domain);
if (cookies) {
await page.setCookie(...parsedCookies);
}
}
Authentication System
Enterprise SSO integration with automatic cookie management:
export class AuthenticationClient {
private cookieFilePath: string;
private cookies: Map<string, string> = new Map();
private constructor() {
this.cookieFilePath = path.join(os.homedir(), ".auth", "cookie");
this.loadAuthCookies();
}
}
Authentication Flow
private async handleAuthentication(page: puppeteer.Page): Promise<void> {
const isAuthPage = await page.evaluate(() => {
return window.location.href.includes('auth.company.com');
});
if (isAuthPage) {
const continueButton = await page.$('button[type="submit"]');
if (continueButton) {
await continueButton.click();
await page.waitForNavigation({ timeout: 120_000 });
}
}
}
Content Processing
The tool extracts and converts web content to markdown:
// Convert links to absolute URLs
await page.evaluate((baseUrl) => {
const links = Array.from(document.querySelectorAll("a[href]"));
links.forEach((link) => {
const href = link.getAttribute("href");
if (href) {
const absoluteUrl = new URL(href, baseUrl).toString();
link.setAttribute("href", absoluteUrl);
}
});
}, url.toString());
Content Extraction
Uses Mozilla’s Readability algorithm for clean content extraction:
const dom = new JSDOM(htmlContent, { url: url.toString() });
const reader = new Readability(dom.window.document);
const article = reader.parse();
if (article && article.content) {
const turndownService = new TurndownService();
result.mainContent = turndownService.turndown(article.content);
}
Performance
Batch Processing
Supports concurrent processing of multiple URLs:
const results = await processBatch(
urls,
async (url) => await urlProcessor.processSingleUrl(url),
{ concurrencyLimit: 5 }
);
Technology Stack
Core Technologies:
- Puppeteer: Browser automation
- JSDOM: Server-side DOM manipulation
- Turndown: HTML to Markdown conversion
- Readability: Content extraction
- Zod: Parameter validation
Adding New Websites
The system uses static registration - new websites require code changes:
1. Create Strategy
export class NewWebsiteStrategy {
static readonly toolRegistration = {
condition: (input: string): boolean => input.startsWith("https://newsite.company.com"),
process: async (input: string) => {
const strategy = new NewWebsiteStrategy();
return await strategy.execute(input);
},
};
}
2. Register Strategy
export const matchers = [
NewWebsiteStrategy.toolRegistration,
// ... other strategies
] as const;
3. URL Matching
The system finds the first matching strategy:
const matcher = matchers.find((m) => m.condition(url));
if (!matcher) {
throw new Error(`Unrecognized URL format: ${url}`);
}
Development Process
- Create strategy file
- Register in matchers array
- Test and deploy
This ensures reliability over automatic discovery.
Runtime Execution
Request Flow
- Client sends JSON-RPC request with URL
- Server validates parameters
- URL processor finds matching strategy
- Strategy executes and processes website
- Response formatted and returned
Request Format
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "read_internal_website",
"arguments": {
"url": "https://docs.company.com/api/documentation"
}
},
"id": 1
}
Strategy Selection
export class UrlProcessor {
public async processSingleUrl(url: string): Promise<any> {
// Find first matching strategy
const matcher = matchers.find((m) => m.condition(url));
if (!matcher) {
throw new Error(`Unrecognized URL format: ${url}`);
}
return await this.processWithStrategy(matcher, url);
}
}
Key Points:
- Strategies checked in registration order
- First match wins
- Generic fallback available
- Clear error handling for unsupported URLs
Key Design Patterns
- Strategy Pattern - Each website type has its own processing strategy
- Static Registration - Strategies registered at compile time for reliability
- Chain of Responsibility - First matching strategy handles the request
- Batch Processing - Concurrent processing with configurable limits
- Authentication Abstraction - Transparent authentication handling
- Content Normalization - Consistent markdown output format
Summary
This architecture provides a solid foundation for building website browser tools with:
- Modular design for easy extension
- Reliable authentication handling
- Performance optimization through concurrency
- Quality content extraction using proven algorithms
- Error resilience with clear fallback mechanisms
The patterns demonstrated here work well for both internal tooling and public APIs requiring robust website browsing capabilities.
URL Discovery and Documentation
What Documentation Exists
The MCP server provides tool-level documentation but not individual URL documentation:
export const ReadInternalWebsiteTool: Tool = {
name: "read_internal_website",
description: [
"Read content from internal websites.",
"",
"Supported website categories:",
"- docs.company.com: Technical documentation",
"- wiki.company.com: Internal wikis",
"- code.company.com: Code repositories",
"- tasks.company.com: Project management",
// ... more categories
].join("\n")
};
What’s NOT Available
- No individual URL documentation - No descriptions for specific URLs
- No URL discovery - Can’t query “what URLs are available?”
- No parameter examples - No guidance on URL structure
How Clients Find URLs
- Tool description - Lists supported website categories
- Trial and error - Try URLs and handle errors
- External documentation - Organization maintains URL catalogs separately
- Application logic - Clients construct URLs based on business needs
Example Discovery Process
// Client tries a URL
const url = "https://wiki.company.com/project-status";
try {
const result = await mcpClient.callTool("read_internal_website", { url });
// Success - URL pattern is supported
} catch (error) {
if (error.message.includes("Unrecognized URL format")) {
// URL pattern not supported
}
}
Design Trade-offs
Pros:
- Simple server design
- Flexible URL handling
- No need to maintain URL catalogs
Cons:
- Clients must know URLs beforehand
- Limited discovery capabilities
- Trial-and-error approach needed
Key Point: The server handles “how to browse” while clients handle “what to browse” - this separation keeps the architecture simple and flexible.
Comments