feat: batch document import for RAG indexing #300

New issue

Closed

opened 2026-01-25 17:38:55 +00:00 by jack · 0 comments

jack commented

2026-01-25 17:38:55 +00:00

Owner

Summary

Add a batch import feature to ingest existing documents into the memory system for semantic search and retrieval.

Use Cases

Import existing documentation into searchable memory
Index a codebase or research papers
Migrate notes/knowledge bases
Pre-populate context for new projects

Proposed Implementation

CLI Command

claude-mem import ./docs --project "my-project" --type "note"
claude-mem import ./research/*.pdf --recursive

API Endpoint

POST /api/import
{
  "files": ["path/to/doc1.md", "path/to/doc2.md"],
  "project": "my-project",
  "type": "note",
  "options": {
    "generateEmbeddings": true,
    "generateSummary": false,  // Optional LLM summarization
    "chunkSize": 4000         // For large documents
  }
}

Supported Formats

Markdown (.md)
Plain text (.txt)
Code files (with syntax detection)
PDF (text extraction)
JSON/YAML (structured data)

Features

Direct Import (no LLM required)
- Store raw content as observations
- Generate embeddings via configured provider
- Fast and cost-effective
LLM-Enhanced Import (optional)
- Generate summaries
- Extract concepts/facts
- Auto-tagging
Progress Tracking
- Queue-based processing
- Progress API endpoint
- Batch status reporting
Chunking
- Large documents split into searchable chunks
- Configurable chunk size and overlap
- Maintain document references

Difference from Ragtime

The upstream ragtime tool uses Claude as orchestrator (every file goes through Claude). This feature should:

Work without LLM calls by default (just embedding)
Be provider-agnostic (use configured embedding provider)
Support LLM enhancement as optional feature
Be integrated into CLI and API (not standalone script)

Inspired by ragtime from upstream
Uses existing embedding infrastructure from Issue #112

## Summary Add a batch import feature to ingest existing documents into the memory system for semantic search and retrieval. ## Use Cases - Import existing documentation into searchable memory - Index a codebase or research papers - Migrate notes/knowledge bases - Pre-populate context for new projects ## Proposed Implementation ### CLI Command ```bash claude-mem import ./docs --project "my-project" --type "note" claude-mem import ./research/*.pdf --recursive ``` ### API Endpoint ``` POST /api/import { "files": ["path/to/doc1.md", "path/to/doc2.md"], "project": "my-project", "type": "note", "options": { "generateEmbeddings": true, "generateSummary": false, // Optional LLM summarization "chunkSize": 4000 // For large documents } } ``` ### Supported Formats - Markdown (`.md`) - Plain text (`.txt`) - Code files (with syntax detection) - PDF (text extraction) - JSON/YAML (structured data) ## Features 1. **Direct Import** (no LLM required) - Store raw content as observations - Generate embeddings via configured provider - Fast and cost-effective 2. **LLM-Enhanced Import** (optional) - Generate summaries - Extract concepts/facts - Auto-tagging 3. **Progress Tracking** - Queue-based processing - Progress API endpoint - Batch status reporting 4. **Chunking** - Large documents split into searchable chunks - Configurable chunk size and overlap - Maintain document references ## Difference from Ragtime The upstream `ragtime` tool uses Claude as orchestrator (every file goes through Claude). This feature should: - Work without LLM calls by default (just embedding) - Be provider-agnostic (use configured embedding provider) - Support LLM enhancement as optional feature - Be integrated into CLI and API (not standalone script) ## Related - Inspired by [ragtime](https://github.com/thedotmack/claude-mem/tree/main/ragtime) from upstream - Uses existing embedding infrastructure from Issue #112