Skip to main content

Overview

Knowledge Search uses Retrieval Augmented Generation (RAG) to provide intelligent, semantic search across your SharePoint document libraries. Unlike traditional keyword search, Knowledge Search understands the meaning of your questions and finds relevant information even when exact keywords don’t match.

Semantic Search

Understands intent, not just keywords

Multi-Corpus Query

Search multiple document collections simultaneously

Source Attribution

Every answer includes citations to source documents

Auto-Sync

Automatically updates when documents change

What is a Corpus?

A corpus is a collection of documents from a SharePoint site or document library that has been:
  • Embedded (converted to vector representations)
  • Indexed for semantic search
  • Made searchable through AVA
Think of it as creating a “knowledge base” from your documents.

Example Corpora

Creating a Corpus

1

Navigate to Knowledge Search

Click “Knowledge Search” in AVA navigation
2

Click 'Create New Corpus'

Opens corpus creation dialog
3

Create New Corpus dialog showing SharePoint site selection and configuration options
4

Select SharePoint Site

  • Choose from dropdown of available sites
  • Must have read access to the site
  • Can preview site contents
5

Select Document Library

  • Choose specific library within site
  • See document count preview
  • Supported types: PDF, Word, Excel, PowerPoint, text
6

Add Description

Write clear description: “Contains all vendor contracts from 2020-present. Use for contract term searches, renewal dates, and clause analysis.”Good descriptions help users know when to use this corpus.
7

Configure Auto-Sync

Auto-Sync ON: AVA periodically checks SharePoint for:
  • New documents (added automatically)
  • Modified documents (re-embedded)
  • Deleted documents (removed from corpus)
Auto-Sync OFF: Corpus is static, manual updates onlyRecommendation: ON for active document libraries
Auto-sync toggle with tooltip explaining automatic document updates
8

Set Field Extraction (Optional)

Extract custom fields from documents for enhanced filtering:Common fields:
  • document_type: Contract, Policy, Guide
  • department: Sales, Legal, Engineering
  • date: Effective date, creation date
  • author: Document creator
  • company: Customer or vendor name
  • keywords: Custom tags
Note: Field extraction cannot be changed after creation.
9

Create & Process

  • Click “Create Corpus + Add Files”
  • AVA begins embedding documents
  • Processing time: ~1 minute per 10 documents
  • You can leave and return later
1

Select Corpus

Click ”@ Connected Data” button, choose corpus from dropdownExample: Select “@Contracts Demo”Tip: You can select multiple corpora to search simultaneously
2

Ask Question

Type your question in natural language:Examples:
  • “Find contracts with termination for convenience clauses”
  • “What are the indemnification terms in customer contracts?”
  • “Show me all contracts expiring in Q1 2025”
3

Review Results

AVA returns:
  • Answer: AI-generated response synthesizing relevant information
  • Sources: Links to specific documents with relevance scores
  • Excerpts: Relevant passages highlighted
Each source is clickable to open full document
4

Refine Search

Ask follow-up questions:
  • “Which of these have the shortest notice period?”
  • “Compare the payment terms across these contracts”
  • “Show me the specific contract language”

Advanced Search Techniques

If corpus has field extraction configured:Query: “Find contracts where document_type=‘MSA’ and company contains ‘Tech’”AVA filters documents matching criteria before searching contentUse Case: Narrow search to specific document categories
Query: “Show contracts signed in 2024”AVA understands temporal queries and filters accordinglyMore examples:
  • “Policies updated in the last 6 months”
  • “Guides created this year”
  • “Contracts expiring next quarter”
Query: “Compare payment terms in Microsoft contract vs Salesforce contract”AVA retrieves both documents and provides side-by-side comparisonMore examples:
  • “How does our current PTO policy differ from the 2020 version?”
  • “Compare indemnification clauses across top 5 contracts”
Query: “Create a table of all contracts with columns: Company, Value, Expiration Date, Auto-Renewal”AVA extracts structured data from multiple documentsExport: Can export table to ExcelUse Case: Data extraction and analysis

Managing Corpora

Corpus Dashboard

View all your corpora and manage embedded files:
Knowledge Search showing embedded files from Contracts Demo with file details, descriptions, and pagination

Corpus List

  • All corpora you own or have access to
  • Document count per corpus
  • Last sync timestamp
  • Storage size

Search Analytics

  • Most searched corpora
  • Common queries
  • User adoption metrics
  • Performance stats

Embedded Files View

  • See all documents in corpus
  • Individual file status
  • Add or remove specific files
  • Re-embed modified documents

Sharing Settings

  • Who has access
  • Permission levels
  • Share with teams or individuals

Corpus Maintenance

  • Adding Documents
  • Removing Documents
  • Re-Embedding
  • Corpus Settings
With Auto-Sync ON:
  • Add documents to SharePoint library
  • AVA automatically detects and embeds
  • Typically within 1 hour
With Auto-Sync OFF:
  1. Navigate to corpus settings
  2. Click “Add Files” button
  3. Select files from SharePoint
  4. Click “Embed Files”
Manual Selection:
  • Choose specific documents to add
  • Useful for selective corpus building

How Knowledge Search Works (Technical)

The RAG Pipeline

1

Document Ingestion

When you create a corpus:
  1. AVA connects to SharePoint using your delegated permissions
  2. Downloads documents you have access to
  3. Extracts text content from each file
  4. Chunks documents into manageable segments (~500 tokens each)
2

Embedding Generation

For each chunk:
  1. Sent to embedding model (text-embedding-ada-002)
  2. Converted to vector representation (1536 dimensions)
  3. Vector stored in PostgreSQL with pgVector extension
  4. Metadata stored: filename, page number, chunk position
3

Search Query Processing

When you search:
  1. Your question is converted to vector embedding
  2. pgVector performs similarity search across all embedded chunks
  3. Top K most relevant chunks retrieved (typically 5-10)
  4. Relevance scores calculated
4

Response Generation

Retrieved chunks sent to AI model as context:
  1. Model receives: your question + relevant document excerpts
  2. AI generates response based on actual content
  3. Citations added automatically
  4. Source documents linked
Keyword SearchKnowledge Search (RAG)
Exact word matches onlyUnderstands meaning and intent
Misses synonyms and variationsFinds semantically similar content
No context understandingConsiders document context
Returns documents, not answersGenerates specific answers
Manual review of resultsAI-synthesized responses
No source attributionAutomatic citations
Example: Query: “What’s our policy on working from home?” Keyword Search:
  • Might miss documents that say “remote work” or “telecommute”
  • Returns list of potentially relevant documents
  • You read through each to find answer
Knowledge Search:
  • Finds documents about remote work, telecommuting, work-from-home
  • Returns: “According to the Remote Work Policy (updated Jan 2024), employees can work from home up to 3 days per week with manager approval…”
  • Includes link to exact policy document

Use Cases by Department

Corpus: All contracts (vendor, customer, partner)Common Searches:
  • “Find all contracts with limitation of liability caps under $1M”
  • “Which contracts allow assignment to affiliates?”
  • “Show me indemnification obligations in SaaS contracts”
  • “Create table of all contract renewal dates in next 90 days”
Time Savings: Days of manual review → Minutes
Corpus: Regulatory documents, compliance policiesCommon Searches:
  • “What are GDPR requirements for data retention?”
  • “Show me all data privacy policies”
  • “What’s required for SOC2 compliance?”
Benefit: Ensure compliance with current regulations

HR Department

Policy Questions

Corpus: Employee handbook, HR policiesUse: Answer employee questions instantly “What’s the parental leave policy?” “How do I request PTO?”

Benefits Info

Corpus: Benefits guides, provider documentsUse: Help employees understand benefits “What dental plans are available?” “How does HSA work?”

Onboarding

Corpus: Onboarding materials, training docsUse: New hire questions “What systems do I need access to?” “What’s the dress code?”

Procedures

Corpus: HR process documentationUse: HR team reference “How do I process a termination?” “What’s the promotion approval process?”

IT/Support

  • Troubleshooting
  • How-To Guides
  • System Documentation
Corpus: Technical support documentationSearches:
  • “How to fix login timeout errors?”
  • “Steps to reset user password”
  • “Troubleshoot VPN connection issues”
Benefit: Faster ticket resolution

Sales & Marketing

1

Competitive Intelligence

Corpus: Competitive research, battle cardsUse: “How do we compare to Competitor X on feature Y?”
2

Past Proposals

Corpus: Winning proposals and case studiesUse: “Find proposals for healthcare industry customers”
3

Product Documentation

Corpus: Product specs, feature descriptionsUse: “What are the key features of Product X?”
4

Customer Case Studies

Corpus: Success stories, testimonialsUse: “Find case studies showing ROI > 200%“

Best Practices

Create separate corpora for different use cases:✅ Good:
  • “Vendor Contracts” corpus
  • “Customer Contracts” corpus
  • “NDAs” corpus
❌ Avoid:
  • Single “All Contracts” corpus with everything
Why: More focused search, better results
Help users understand when to use each corpus:✅ Good: “Contains all active vendor contracts from 2022-present. Use for: payment terms, renewal dates, SLA requirements. Auto-synced daily.”❌ Avoid: “Vendor stuff”Why: Users find the right corpus faster
For document libraries that change frequently:
  • Policy documents
  • Active contracts
  • Current product documentation
Turn OFF auto-sync for:
  • Historical/archived documents
  • Reference libraries that don’t change
Why: Keep corpus current without manual work
Plan field extraction before creating corpus:Useful fields:
  • Document category/type
  • Date (effective, expiration, creation)
  • Company/customer name
  • Department/owner
  • Status (active, expired, draft)
Why: Enables filtered searches and better organization
After creating corpus, test with common questions:
  1. Ask typical user questions
  2. Verify relevant documents are returned
  3. Check if answers are accurate
  4. Refine corpus if needed
Why: Ensure corpus meets user needs

Performance & Limits

Document Limits

  • Max documents per corpus: 10,000
  • Max file size: 50MB
  • Supported types: PDF, Word, Excel, PowerPoint, TXT

Search Performance

  • Vector search: Less than 1 second
  • AI response generation: 3-8 seconds
  • Typical total: 5-10 seconds per query

Storage

  • Vectors stored in PostgreSQL with pgVector
  • Original files remain in SharePoint
  • Corpus metadata and embeddings: ~1MB per 100 pages

Concurrent Users

  • No hard limit on concurrent searches
  • Auto-scales with Azure Container Apps
  • Performance degrades gracefully under load

Troubleshooting

  • No Results
  • Wrong Results
  • Slow Processing
  • Auto-Sync Issues
Problem: Search returns no relevant resultsSolutions:
  1. Try rephrasing question
  2. Check if documents are actually embedded (view embedded files)
  3. Verify you have access to source SharePoint library
  4. Try broader search terms
  5. Check if corpus needs re-embedding (if documents updated)

Next Steps