feat: batch document import for RAG indexing #300

Closed
opened 2026-01-25 17:38:55 +00:00 by jack · 0 comments
Owner

Summary

Add a batch import feature to ingest existing documents into the memory system for semantic search and retrieval.

Use Cases

  • Import existing documentation into searchable memory
  • Index a codebase or research papers
  • Migrate notes/knowledge bases
  • Pre-populate context for new projects

Proposed Implementation

CLI Command

claude-mem import ./docs --project "my-project" --type "note"
claude-mem import ./research/*.pdf --recursive

API Endpoint

POST /api/import
{
  "files": ["path/to/doc1.md", "path/to/doc2.md"],
  "project": "my-project",
  "type": "note",
  "options": {
    "generateEmbeddings": true,
    "generateSummary": false,  // Optional LLM summarization
    "chunkSize": 4000         // For large documents
  }
}

Supported Formats

  • Markdown (.md)
  • Plain text (.txt)
  • Code files (with syntax detection)
  • PDF (text extraction)
  • JSON/YAML (structured data)

Features

  1. Direct Import (no LLM required)

    • Store raw content as observations
    • Generate embeddings via configured provider
    • Fast and cost-effective
  2. LLM-Enhanced Import (optional)

    • Generate summaries
    • Extract concepts/facts
    • Auto-tagging
  3. Progress Tracking

    • Queue-based processing
    • Progress API endpoint
    • Batch status reporting
  4. Chunking

    • Large documents split into searchable chunks
    • Configurable chunk size and overlap
    • Maintain document references

Difference from Ragtime

The upstream ragtime tool uses Claude as orchestrator (every file goes through Claude). This feature should:

  • Work without LLM calls by default (just embedding)
  • Be provider-agnostic (use configured embedding provider)
  • Support LLM enhancement as optional feature
  • Be integrated into CLI and API (not standalone script)
  • Inspired by ragtime from upstream
  • Uses existing embedding infrastructure from Issue #112
## Summary Add a batch import feature to ingest existing documents into the memory system for semantic search and retrieval. ## Use Cases - Import existing documentation into searchable memory - Index a codebase or research papers - Migrate notes/knowledge bases - Pre-populate context for new projects ## Proposed Implementation ### CLI Command ```bash claude-mem import ./docs --project "my-project" --type "note" claude-mem import ./research/*.pdf --recursive ``` ### API Endpoint ``` POST /api/import { "files": ["path/to/doc1.md", "path/to/doc2.md"], "project": "my-project", "type": "note", "options": { "generateEmbeddings": true, "generateSummary": false, // Optional LLM summarization "chunkSize": 4000 // For large documents } } ``` ### Supported Formats - Markdown (`.md`) - Plain text (`.txt`) - Code files (with syntax detection) - PDF (text extraction) - JSON/YAML (structured data) ## Features 1. **Direct Import** (no LLM required) - Store raw content as observations - Generate embeddings via configured provider - Fast and cost-effective 2. **LLM-Enhanced Import** (optional) - Generate summaries - Extract concepts/facts - Auto-tagging 3. **Progress Tracking** - Queue-based processing - Progress API endpoint - Batch status reporting 4. **Chunking** - Large documents split into searchable chunks - Configurable chunk size and overlap - Maintain document references ## Difference from Ragtime The upstream `ragtime` tool uses Claude as orchestrator (every file goes through Claude). This feature should: - Work without LLM calls by default (just embedding) - Be provider-agnostic (use configured embedding provider) - Support LLM enhancement as optional feature - Be integrated into CLI and API (not standalone script) ## Related - Inspired by [ragtime](https://github.com/thedotmack/claude-mem/tree/main/ragtime) from upstream - Uses existing embedding infrastructure from Issue #112
jack closed this issue 2026-01-25 20:11:07 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
customable/claude-mem#300
No description provided.