feat: Store MCP documentation lookups as searchable documents (Context7, etc.) #111

Closed
opened 2026-01-22 22:54:07 +00:00 by jack · 1 comment
Owner

Problem

When Claude uses MCP servers like Context7 to look up documentation, the results are valuable knowledge that could be reused:

  • "How to use React useEffect" → Context7 returns detailed docs
  • Currently: Stored only as an observation (compressed summary)
  • Better: Store the full documentation as a searchable document

This creates a personal documentation cache that grows over time, reducing redundant API calls and improving context recall.

Proposed Solution

Dual Storage for Documentation MCP Tools

When certain MCP tools return structured documentation, store them in two places:

  1. Observation (compressed) - For session timeline
  2. Document (full content) - For semantic search and reuse

Target MCP Tools

Documentation Servers:

  • mcp__context7__query-docs - Library/framework documentation
  • mcp__context7__resolve-library-id - Library search results
  • Future: Stack Overflow MCP, MDN docs, etc.

Other Knowledge Sources:

  • mcp__playwright__browser_snapshot - Captured web content
  • WebFetch results (if structured/valuable)
  • Custom documentation MCPs

Implementation Strategy

1. Detect Documentation Tools

const DOCUMENTATION_TOOLS = new Set([
  'mcp__context7__query-docs',
  'mcp__context7__resolve-library-id',
  'WebFetch',  // Conditional on content type
  'mcp__playwright__browser_snapshot'
]);

function isDocumentationTool(toolName: string): boolean {
  return DOCUMENTATION_TOOLS.has(toolName);
}

2. Extract Structured Content

async function extractDocumentation(toolName: string, toolResponse: any) {
  switch (toolName) {
    case 'mcp__context7__query-docs':
      return {
        type: 'library-docs',
        library: toolResponse.library_id,  // e.g., "/vercel/next.js"
        query: toolResponse.query,
        content: toolResponse.docs,        // Full documentation text
        code_snippets: toolResponse.code_examples,
        metadata: {
          language: detectLanguage(toolResponse),
          framework: extractFramework(toolResponse.library_id)
        }
      };
      
    case 'mcp__playwright__browser_snapshot':
      return {
        type: 'web-content',
        url: toolResponse.url,
        content: toolResponse.markdown,
        title: toolResponse.title,
        metadata: {
          captured_at: Date.now()
        }
      };
      
    // ... other tools
  }
}

3. Store as Document in Qdrant

async function storeDocumentation(doc: DocumentationExtract, sessionId: string) {
  const vectorStore = getQdrantClient();
  
  await vectorStore.upsert('documents', {
    id: generateDocId(),
    vector: await embedText(doc.content),
    payload: {
      type: doc.type,
      content: doc.content,
      title: doc.title || generateTitle(doc),
      source: doc.library || doc.url,
      session_id: sessionId,
      created_at: Date.now(),
      metadata: doc.metadata,
      
      // Cross-reference to observation
      observation_id: doc.observation_id,
      
      // Searchable fields
      tags: extractTags(doc),
      concepts: extractConcepts(doc)
    }
  });
}

4. Deduplication Strategy

Avoid storing duplicate documentation:

async function shouldStoreDocument(doc: DocumentationExtract): Promise<boolean> {
  // Check if similar content already exists
  const existing = await vectorStore.search('documents', {
    vector: await embedText(doc.content),
    filter: {
      type: doc.type,
      source: doc.source
    },
    limit: 1,
    score_threshold: 0.95  // 95% similarity
  });
  
  if (existing.length > 0) {
    // Update metadata (e.g., "last accessed") instead of creating duplicate
    await vectorStore.updateMetadata(existing[0].id, {
      last_accessed: Date.now(),
      access_count: existing[0].payload.access_count + 1
    });
    return false;
  }
  
  return true;
}

Benefits

1. Documentation Cache

User: "How do I use React useEffect?"
Claude: [Searches Qdrant documents]
Found: Context7 docs from session #123 (2 weeks ago)
Response: Based on previous lookup... [full docs available]

2. Context Enrichment

User: "Fix this Next.js bug"
Claude: [Searches both observations AND documents]
- Observation: Used Next.js routing last week
- Document: Full Next.js 14 routing docs from Context7

3. Reduced API Calls

  • First lookup: Fetches from Context7 → Stores as document
  • Future lookups: Returns from Qdrant (instant, no API cost)
  • Updates only when content is stale

4. Cross-Session Learning

Session A: Looked up Playwright browser automation
Session B: "How did I automate that login flow?"
  → Finds: Playwright docs + browser snapshot from Session A

UI Integration

Documents View (WebUI)

New section in web viewer:

📚 Documentation Library
├─ React (5 documents)
│  ├─ useEffect hook (accessed 3x, last: 2 days ago)
│  ├─ useState patterns (accessed 1x, last: 1 week ago)
│  └─ ...
├─ Next.js (8 documents)
├─ Playwright (2 documents)
└─ Web Content (12 captures)

Search Enhancement

// Search across both observations AND documents
const results = await Promise.all([
  searchObservations(query),
  searchDocuments(query)
]);

// Merge and rank by relevance
return rankByRelevance([...results]);

MCP Search Tool Enhancement

// Current: Only searches observations
mcp__plugin_claude-mem_mcp-search__search(query)

// Enhanced: Search both observations and documents
mcp__plugin_claude-mem_mcp-search__search(query, {
  include_documents: true,
  document_types: ['library-docs', 'web-content']
})

Storage Optimization

Document Lifecycle

{
  retention_policy: {
    min_access_count: 2,      // Keep if accessed 2+ times
    max_age_days: 90,          // Delete if older than 90 days AND never reused
    auto_archive: true         // Archive rarely-used docs to cold storage
  }
}

Compression

  • Large documents (>50KB): Store compressed in DB
  • Embeddings: Full precision for documents (more important than observations)
  • Code snippets: Syntax-highlighted and indexed separately
  • #110 Capture all MCP tool usage
  • Vector store enhancement for mixed content types
  • MCP search tool API expansion

Open Questions

  1. Should we auto-expire documents that are never accessed again?
  2. How to handle documentation updates (e.g., new Next.js version)?
  3. Should users manually curate their documentation library?
  4. Privacy considerations for web content captures?
  5. Should we support manual document uploads (e.g., PDF docs, custom notes)?

Example Schema

interface DocumentRecord {
  id: string;
  type: 'library-docs' | 'web-content' | 'api-reference' | 'custom';
  title: string;
  content: string;              // Full text
  source: string;               // Library ID, URL, or file path
  
  // Embeddings
  vector: number[];
  
  // Metadata
  session_id: string;           // Where it was first captured
  observation_id?: string;      // Cross-reference
  created_at: number;
  last_accessed: number;
  access_count: number;
  
  // Structured data
  code_snippets?: CodeSnippet[];
  metadata: {
    language?: string;
    framework?: string;
    version?: string;
    url?: string;
  };
  
  // Search optimization
  tags: string[];
  concepts: string[];
}

This creates a knowledge graph where:

  • Observations = "What I did and learned"
  • Documents = "Reference material I need frequently"
## Problem When Claude uses MCP servers like Context7 to look up documentation, the results are valuable knowledge that could be reused: - "How to use React useEffect" → Context7 returns detailed docs - Currently: Stored only as an observation (compressed summary) - Better: Store the **full documentation** as a searchable document This creates a **personal documentation cache** that grows over time, reducing redundant API calls and improving context recall. ## Proposed Solution ### Dual Storage for Documentation MCP Tools When certain MCP tools return structured documentation, store them in **two** places: 1. **Observation** (compressed) - For session timeline 2. **Document** (full content) - For semantic search and reuse ### Target MCP Tools **Documentation Servers:** - `mcp__context7__query-docs` - Library/framework documentation - `mcp__context7__resolve-library-id` - Library search results - Future: Stack Overflow MCP, MDN docs, etc. **Other Knowledge Sources:** - `mcp__playwright__browser_snapshot` - Captured web content - `WebFetch` results (if structured/valuable) - Custom documentation MCPs ### Implementation Strategy #### 1. Detect Documentation Tools ```typescript const DOCUMENTATION_TOOLS = new Set([ 'mcp__context7__query-docs', 'mcp__context7__resolve-library-id', 'WebFetch', // Conditional on content type 'mcp__playwright__browser_snapshot' ]); function isDocumentationTool(toolName: string): boolean { return DOCUMENTATION_TOOLS.has(toolName); } ``` #### 2. Extract Structured Content ```typescript async function extractDocumentation(toolName: string, toolResponse: any) { switch (toolName) { case 'mcp__context7__query-docs': return { type: 'library-docs', library: toolResponse.library_id, // e.g., "/vercel/next.js" query: toolResponse.query, content: toolResponse.docs, // Full documentation text code_snippets: toolResponse.code_examples, metadata: { language: detectLanguage(toolResponse), framework: extractFramework(toolResponse.library_id) } }; case 'mcp__playwright__browser_snapshot': return { type: 'web-content', url: toolResponse.url, content: toolResponse.markdown, title: toolResponse.title, metadata: { captured_at: Date.now() } }; // ... other tools } } ``` #### 3. Store as Document in Qdrant ```typescript async function storeDocumentation(doc: DocumentationExtract, sessionId: string) { const vectorStore = getQdrantClient(); await vectorStore.upsert('documents', { id: generateDocId(), vector: await embedText(doc.content), payload: { type: doc.type, content: doc.content, title: doc.title || generateTitle(doc), source: doc.library || doc.url, session_id: sessionId, created_at: Date.now(), metadata: doc.metadata, // Cross-reference to observation observation_id: doc.observation_id, // Searchable fields tags: extractTags(doc), concepts: extractConcepts(doc) } }); } ``` #### 4. Deduplication Strategy Avoid storing duplicate documentation: ```typescript async function shouldStoreDocument(doc: DocumentationExtract): Promise<boolean> { // Check if similar content already exists const existing = await vectorStore.search('documents', { vector: await embedText(doc.content), filter: { type: doc.type, source: doc.source }, limit: 1, score_threshold: 0.95 // 95% similarity }); if (existing.length > 0) { // Update metadata (e.g., "last accessed") instead of creating duplicate await vectorStore.updateMetadata(existing[0].id, { last_accessed: Date.now(), access_count: existing[0].payload.access_count + 1 }); return false; } return true; } ``` ### Benefits #### 1. Documentation Cache ``` User: "How do I use React useEffect?" Claude: [Searches Qdrant documents] Found: Context7 docs from session #123 (2 weeks ago) Response: Based on previous lookup... [full docs available] ``` #### 2. Context Enrichment ``` User: "Fix this Next.js bug" Claude: [Searches both observations AND documents] - Observation: Used Next.js routing last week - Document: Full Next.js 14 routing docs from Context7 ``` #### 3. Reduced API Calls - First lookup: Fetches from Context7 → Stores as document - Future lookups: Returns from Qdrant (instant, no API cost) - Updates only when content is stale #### 4. Cross-Session Learning ``` Session A: Looked up Playwright browser automation Session B: "How did I automate that login flow?" → Finds: Playwright docs + browser snapshot from Session A ``` ### UI Integration #### Documents View (WebUI) New section in web viewer: ``` 📚 Documentation Library ├─ React (5 documents) │ ├─ useEffect hook (accessed 3x, last: 2 days ago) │ ├─ useState patterns (accessed 1x, last: 1 week ago) │ └─ ... ├─ Next.js (8 documents) ├─ Playwright (2 documents) └─ Web Content (12 captures) ``` #### Search Enhancement ```typescript // Search across both observations AND documents const results = await Promise.all([ searchObservations(query), searchDocuments(query) ]); // Merge and rank by relevance return rankByRelevance([...results]); ``` ### MCP Search Tool Enhancement ```typescript // Current: Only searches observations mcp__plugin_claude-mem_mcp-search__search(query) // Enhanced: Search both observations and documents mcp__plugin_claude-mem_mcp-search__search(query, { include_documents: true, document_types: ['library-docs', 'web-content'] }) ``` ### Storage Optimization #### Document Lifecycle ```typescript { retention_policy: { min_access_count: 2, // Keep if accessed 2+ times max_age_days: 90, // Delete if older than 90 days AND never reused auto_archive: true // Archive rarely-used docs to cold storage } } ``` #### Compression - Large documents (>50KB): Store compressed in DB - Embeddings: Full precision for documents (more important than observations) - Code snippets: Syntax-highlighted and indexed separately ### Related Issues - #110 Capture all MCP tool usage - Vector store enhancement for mixed content types - MCP search tool API expansion ### Open Questions 1. Should we auto-expire documents that are never accessed again? 2. How to handle documentation updates (e.g., new Next.js version)? 3. Should users manually curate their documentation library? 4. Privacy considerations for web content captures? 5. Should we support manual document uploads (e.g., PDF docs, custom notes)? ### Example Schema ```typescript interface DocumentRecord { id: string; type: 'library-docs' | 'web-content' | 'api-reference' | 'custom'; title: string; content: string; // Full text source: string; // Library ID, URL, or file path // Embeddings vector: number[]; // Metadata session_id: string; // Where it was first captured observation_id?: string; // Cross-reference created_at: number; last_accessed: number; access_count: number; // Structured data code_snippets?: CodeSnippet[]; metadata: { language?: string; framework?: string; version?: string; url?: string; }; // Search optimization tags: string[]; concepts: string[]; } ``` This creates a **knowledge graph** where: - Observations = "What I did and learned" - Documents = "Reference material I need frequently"
Author
Owner

Implementation Summary

Implemented Features

Database Layer:

  • New documents table with FTS5 full-text search
  • DocumentRecord type with fields: id, project, source, source_tool, title, content, content_hash, type, metadata, memory_session_id, observation_id, access_count, last_accessed_epoch
  • DocumentType enum: library-docs, web-content, api-reference, code-example, tutorial, custom
  • IDocumentRepository interface with full CRUD + deduplication support
  • SQLiteDocumentRepository implementation

Backend API:

  • GET /api/data/documents - List with filters (project, source, sourceTool, type)
  • GET /api/data/documents/:id - Get single document (also records access)
  • GET /api/data/documents/search?q=query - Full-text search
  • DELETE /api/data/documents/:id - Delete document

Automatic Document Capture:
Extended TaskDispatcher to detect and store documentation from:

  • mcp__context7__query-docslibrary-docs type
  • mcp__context7__resolve-library-idlibrary-docs type
  • WebFetchweb-content type

Deduplication:

  • SHA256 content hash to prevent duplicate storage
  • If same content already exists, just increments access_count and updates last_accessed_epoch

Not Yet Implemented

  • UI view for browsing/searching documents (separate issue)
  • Qdrant vector indexing for semantic search
  • Document cleanup/retention policy
  • MCP search tool enhancement to include documents

Files Changed

  • packages/database/src/migrations/schema.ts - Documents table schema
  • packages/database/src/migrations/runner.ts - Migration v4
  • packages/database/src/sqlite/repositories/documents.ts - New repository
  • packages/types/src/database.ts - DocumentRecord, DocumentType
  • packages/types/src/repository.ts - IDocumentRepository interface
  • packages/backend/src/routes/data.ts - Document endpoints
  • packages/backend/src/websocket/task-dispatcher.ts - Auto-capture logic
## Implementation Summary ### Implemented Features **Database Layer:** - New `documents` table with FTS5 full-text search - `DocumentRecord` type with fields: id, project, source, source_tool, title, content, content_hash, type, metadata, memory_session_id, observation_id, access_count, last_accessed_epoch - `DocumentType` enum: `library-docs`, `web-content`, `api-reference`, `code-example`, `tutorial`, `custom` - `IDocumentRepository` interface with full CRUD + deduplication support - `SQLiteDocumentRepository` implementation **Backend API:** - `GET /api/data/documents` - List with filters (project, source, sourceTool, type) - `GET /api/data/documents/:id` - Get single document (also records access) - `GET /api/data/documents/search?q=query` - Full-text search - `DELETE /api/data/documents/:id` - Delete document **Automatic Document Capture:** Extended `TaskDispatcher` to detect and store documentation from: - `mcp__context7__query-docs` → `library-docs` type - `mcp__context7__resolve-library-id` → `library-docs` type - `WebFetch` → `web-content` type **Deduplication:** - SHA256 content hash to prevent duplicate storage - If same content already exists, just increments `access_count` and updates `last_accessed_epoch` ### Not Yet Implemented - UI view for browsing/searching documents (separate issue) - Qdrant vector indexing for semantic search - Document cleanup/retention policy - MCP search tool enhancement to include documents ### Files Changed - `packages/database/src/migrations/schema.ts` - Documents table schema - `packages/database/src/migrations/runner.ts` - Migration v4 - `packages/database/src/sqlite/repositories/documents.ts` - New repository - `packages/types/src/database.ts` - DocumentRecord, DocumentType - `packages/types/src/repository.ts` - IDocumentRepository interface - `packages/backend/src/routes/data.ts` - Document endpoints - `packages/backend/src/websocket/task-dispatcher.ts` - Auto-capture logic
jack closed this issue 2026-01-23 17:35:41 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
customable/claude-mem#111
No description provided.