Overview
Knowledge Search uses Retrieval Augmented Generation (RAG) to provide intelligent, semantic search across your SharePoint document libraries. Unlike traditional keyword search, Knowledge Search understands the meaning of your questions and finds relevant information even when exact keywords don’t match.Semantic Search
Multi-Corpus Query
Source Attribution
Auto-Sync
What is a Corpus?
A corpus is a collection of documents from a SharePoint site or document library that has been:- Embedded (converted to vector representations)
- Indexed for semantic search
- Made searchable through AVA
Example Corpora
- Legal
- HR
- Support
- Sales
- Source: Legal/Contracts SharePoint library
- 350 vendor and customer contracts
- Auto-sync: ON
- Field extraction: Company Name, Contract Type, Effective Date, Value
Creating a Corpus
Navigate to Knowledge Search
Click 'Create New Corpus'

Select SharePoint Site
- Choose from dropdown of available sites
- Must have read access to the site
- Can preview site contents
Select Document Library
- Choose specific library within site
- See document count preview
- Supported types: PDF, Word, Excel, PowerPoint, text
Add Description
Configure Auto-Sync
- New documents (added automatically)
- Modified documents (re-embedded)
- Deleted documents (removed from corpus)

Set Field Extraction (Optional)
- document_type: Contract, Policy, Guide
- department: Sales, Legal, Engineering
- date: Effective date, creation date
- author: Document creator
- company: Customer or vendor name
- keywords: Custom tags
Create & Process
- Click “Create Corpus + Add Files”
- AVA begins embedding documents
- Processing time: ~1 minute per 10 documents
- You can leave and return later
Using Knowledge Search
Basic Search
Select Corpus
Ask Question
- “Find contracts with termination for convenience clauses”
- “What are the indemnification terms in customer contracts?”
- “Show me all contracts expiring in Q1 2025”
Review Results
- Answer: AI-generated response synthesizing relevant information
- Sources: Links to specific documents with relevance scores
- Excerpts: Relevant passages highlighted
Refine Search
- “Which of these have the shortest notice period?”
- “Compare the payment terms across these contracts”
- “Show me the specific contract language”
Advanced Search Techniques
Multi-Corpus Search
Multi-Corpus Search
- @Contracts Demo
- @Vendor Agreements
- @MSAs (Master Service Agreements)
Field Filtering
Field Filtering
Date Range Queries
Date Range Queries
- “Policies updated in the last 6 months”
- “Guides created this year”
- “Contracts expiring next quarter”
Comparison Queries
Comparison Queries
- “How does our current PTO policy differ from the 2020 version?”
- “Compare indemnification clauses across top 5 contracts”
Extraction Queries
Extraction Queries
Managing Corpora
Corpus Dashboard
View all your corpora and manage embedded files:
Corpus List
- All corpora you own or have access to
- Document count per corpus
- Last sync timestamp
- Storage size
Search Analytics
- Most searched corpora
- Common queries
- User adoption metrics
- Performance stats
Embedded Files View
- See all documents in corpus
- Individual file status
- Add or remove specific files
- Re-embed modified documents
Sharing Settings
- Who has access
- Permission levels
- Share with teams or individuals
Corpus Maintenance
- Adding Documents
- Removing Documents
- Re-Embedding
- Corpus Settings
- Add documents to SharePoint library
- AVA automatically detects and embeds
- Typically within 1 hour
- Navigate to corpus settings
- Click “Add Files” button
- Select files from SharePoint
- Click “Embed Files”
- Choose specific documents to add
- Useful for selective corpus building
How Knowledge Search Works (Technical)
The RAG Pipeline
Document Ingestion
- AVA connects to SharePoint using your delegated permissions
- Downloads documents you have access to
- Extracts text content from each file
- Chunks documents into manageable segments (~500 tokens each)
Embedding Generation
- Sent to embedding model (text-embedding-ada-002)
- Converted to vector representation (1536 dimensions)
- Vector stored in PostgreSQL with pgVector extension
- Metadata stored: filename, page number, chunk position
Search Query Processing
- Your question is converted to vector embedding
- pgVector performs similarity search across all embedded chunks
- Top K most relevant chunks retrieved (typically 5-10)
- Relevance scores calculated
Response Generation
- Model receives: your question + relevant document excerpts
- AI generates response based on actual content
- Citations added automatically
- Source documents linked
Why This Works Better Than Keyword Search
| Keyword Search | Knowledge Search (RAG) |
|---|---|
| Exact word matches only | Understands meaning and intent |
| Misses synonyms and variations | Finds semantically similar content |
| No context understanding | Considers document context |
| Returns documents, not answers | Generates specific answers |
| Manual review of results | AI-synthesized responses |
| No source attribution | Automatic citations |
- Might miss documents that say “remote work” or “telecommute”
- Returns list of potentially relevant documents
- You read through each to find answer
- Finds documents about remote work, telecommuting, work-from-home
- Returns: “According to the Remote Work Policy (updated Jan 2024), employees can work from home up to 3 days per week with manager approval…”
- Includes link to exact policy document
Use Cases by Department
Legal Department
Contract Management
Contract Management
- “Find all contracts with limitation of liability caps under $1M”
- “Which contracts allow assignment to affiliates?”
- “Show me indemnification obligations in SaaS contracts”
- “Create table of all contract renewal dates in next 90 days”
Legal Research
Legal Research
- “Have we dealt with this issue before?”
- “Find similar disputes and their resolutions”
- “What was Legal’s opinion on X in the past?”
Compliance Documentation
Compliance Documentation
- “What are GDPR requirements for data retention?”
- “Show me all data privacy policies”
- “What’s required for SOC2 compliance?”
HR Department
Policy Questions
Benefits Info
Onboarding
Procedures
IT/Support
- Troubleshooting
- How-To Guides
- System Documentation
- “How to fix login timeout errors?”
- “Steps to reset user password”
- “Troubleshoot VPN connection issues”
Sales & Marketing
Competitive Intelligence
Past Proposals
Product Documentation
Customer Case Studies
Best Practices
Organize by Purpose
Organize by Purpose
- “Vendor Contracts” corpus
- “Customer Contracts” corpus
- “NDAs” corpus
- Single “All Contracts” corpus with everything
Write Clear Descriptions
Write Clear Descriptions
Enable Auto-Sync for Active Libraries
Enable Auto-Sync for Active Libraries
- Policy documents
- Active contracts
- Current product documentation
- Historical/archived documents
- Reference libraries that don’t change
Use Field Extraction
Use Field Extraction
- Document category/type
- Date (effective, expiration, creation)
- Company/customer name
- Department/owner
- Status (active, expired, draft)
Test with Questions
Test with Questions
- Ask typical user questions
- Verify relevant documents are returned
- Check if answers are accurate
- Refine corpus if needed
Performance & Limits
Document Limits
- Max documents per corpus: 10,000
- Max file size: 50MB
- Supported types: PDF, Word, Excel, PowerPoint, TXT
Search Performance
- Vector search: Less than 1 second
- AI response generation: 3-8 seconds
- Typical total: 5-10 seconds per query
Storage
- Vectors stored in PostgreSQL with pgVector
- Original files remain in SharePoint
- Corpus metadata and embeddings: ~1MB per 100 pages
Concurrent Users
- No hard limit on concurrent searches
- Auto-scales with Azure Container Apps
- Performance degrades gracefully under load
Troubleshooting
- No Results
- Wrong Results
- Slow Processing
- Auto-Sync Issues
- Try rephrasing question
- Check if documents are actually embedded (view embedded files)
- Verify you have access to source SharePoint library
- Try broader search terms
- Check if corpus needs re-embedding (if documents updated)
