Building AXAM: A Journey from Concept to Reality - Optimizing AI for Offline Use Pt2
How one graduate student spent weeks fine-tuning a RAG system to bring MIT-level education to students in Resource constraints—and the surprising lessons learned along the way.
The Challenge: Bringing World-Class Education Where the Internet Doesn't Reach
Picture this: You're a high school student in rural Uganda. The nearest university is hours away, internet connectivity is sparse at best, and data costs more than your family can afford. Yet somewhere on the internet, MIT has published thousands of hours of world-class lectures covering everything from calculus to computer science—completely free.
The problem? You can't access them.
This is the gap that Emmanuel, a graduate student at Yeshiva University's Katz School, set out to bridge with AXAM—an AI-powered educational platform designed to work entirely offline. Think of it as having a knowledgeable teaching assistant in your pocket, one that can answer questions about complex academic topics without needing an internet connection.
But building such a system turned out to be far more complex than simply downloading some AI models and calling it a day. What followed was a weeks-long journey of optimization, testing, and hard-won insights that transformed a promising concept into a practical solution.
The Foundation: Understanding RAG (Retrieval-Augmented Generation)
Before diving into the technical journey, let's understand what we're building. Traditional AI chatbots are like students who crammed for an exam months ago—they have general knowledge but can't reference specific materials when needed. RAG systems, on the other hand, are like students taking an open-book exam. They can:
- Search through a library of documents to find relevant information
- Retrieve the most relevant passages
- Generate answers based on that specific context
For AXAM, this meant taking 7,798 MIT OpenCourseWare video transcripts, converting them into searchable chunks, and building a system that could intelligently find and explain relevant content when students ask questions.
Decision Point #1: Choosing the Right Embedding Model
The first major challenge? Deciding how to convert text into numbers that computers can understand—a process called "embedding."
The Starting Point
The initial system used a model called all-MiniLM-L6-v2. It was small (90MB), fast, and got the job done. But "good enough" wasn't the goal when the target audience consists of students who might get one shot at understanding a difficult concept.
The Research Phase
After extensive research into 2025's latest embedding models, several candidates emerged:
- EmbeddingGemma-300M: Google's newest model, 308M parameters, 65.2 MTEB score
- Qwen3-Embedding: 600M parameters, slightly better scores but 4x larger
- BGE-M3 and multilingual-e5-large: Excellent quality but 2.2GB each—too large for resource-constrained deployments
The Aha Moment
The breakthrough came when analyzing what "accuracy" really means in practice. With a corpus of nearly 8,000 documents creating over 117,000 searchable chunks, even a 9-point improvement in MTEB scores (from 56.1 to 65.2) translates to approximately 23% more students getting correct answers to their queries.
Think of it this way: If 100 students ask "How does photosynthesis work?", the old model might correctly direct 56 of them to the right explanation. The new model? 65 students. That extra 9% might not sound revolutionary in a research paper, but it's nine more students who understand biology.
The Decision: EmbeddingGemma-300M won because it offered the best balance:
- 10x smaller than top competitors (227MB vs 2.2GB when quantized)
- Only 1-3 points lower MTEB score than much larger models
- 4x longer context window than alternatives (2048 vs 512 tokens)
- Support for 100+ languages including Swahili, French, and local East African languages
- Mobile-optimized by Google for exactly this type of deployment
Challenge #2: The VTT Subtitle Problem
With the embedding model selected, the next step was preparing the data. This is where an unexpected problem emerged: the MIT transcripts were polluted with VTT (WebVTT) subtitle formatting.
What Was Wrong
VTT files don't just contain what people said—they contain timing information, positioning data, and crucially, progressive repetition. Here's what that looked like:
00:00:01.000 --> 00:00:02.000
hi everyone
00:00:02.000 --> 00:00:03.500
hi everyone welcome
00:00:03.500 --> 00:00:05.000
hi everyone welcome back
The same words repeated three times, inflating file sizes and creating noise in the data.
The Solution
A smart cleaning function was developed that could distinguish between VTT's progressive display (which should be removed) and natural repetition like "thank you, thank you" (which should be kept). The results were dramatic:
Before cleaning: 202,820 characters
After cleaning: 43,963 characters
Reduction: 78.3%
This wasn't just about storage—cleaner data means more relevant content in each chunk, leading to better retrievals and more accurate answers.
Decision Point #3: How to Chunk the Content
Here's a problem that doesn't have an obvious answer: When you have a 50-page transcript, how do you break it into pieces that AI can work with?
The Challenge
- Too small (100-200 words): Fragments lose context, like reading random sentences from a textbook
- Too large (2000+ words): The AI gets overwhelmed with information and loses focus
- Just right: Captures complete thoughts while remaining digestible
The Solution
After testing, the team settled on chunks of approximately 1,500 tokens (roughly 6,000 characters or 1-2 paragraphs). This size:
- Captured complete explanations of concepts
- Fit comfortably within the embedding model's 2,048 token context window
- Provided enough context without overwhelming the retrieval system
An overlap of 300 tokens between chunks ensured that concepts spanning chunk boundaries wouldn't be lost—like having the last sentence of one page repeat at the top of the next.
The Ollama Headache: When Technology Doesn't Cooperate
Not everything went smoothly. After successfully setting up the new embedding model and cleaning the data, the system... crashed.
The Error
500 Internal Server Error: this model does not support embeddings
The Investigation
The problem? Ollama version 0.9.2 was too old to properly support the new embedding model. What followed was a classic tech troubleshooting session:
- Verify the model was actually downloaded (✓)
- Test with curl commands (✗ - model claimed it couldn't embed)
- Check version numbers (0.9.2 vs required 0.4.0+... wait, that doesn't make sense)
- Realize version numbering had changed
- Upgrade to 0.13.1
- Success!
Lesson learned: Always check compatibility before assuming your code is wrong. Sometimes the tools themselves need updating.
The Moment of Truth: Does It Actually Work?
With everything configured, cleaned, and updated, it was time to test whether the system could actually retrieve relevant information.
Test Query
"What is conditional expectation?"
The Results
Result 1 (distance: 0.9666)
Source: Probability lecture
Text: "we will now go through an example which is essentially
a drill to consolidate our understanding of the conditional
expectation and the conditional variance..."
Perfect hit! The system found exactly the right lecture segment. But more importantly, look at the distance scores for results 2 and 3: 1.43. That clear separation (0.97 vs 1.43) showed the model could distinguish between highly relevant and somewhat relevant content—crucial for giving students the best possible answers.
The Speed Problem: When Good Enough Isn't Fast Enough
Early testing revealed a problem: The system worked beautifully but took 4 minutes and 45 seconds (285 seconds) to answer a single question. In a classroom of 40 students, that's... unacceptable.
The Investigation
The bottleneck was identified: retrieving 3 large chunks and using a verbose system prompt consumed too many tokens, leading to slow processing times on CPU-only hardware.
The Solution: A Three-Tier System
Lite Mode (5-10 seconds)
- Smallest model (gemma3:270m)
- 1 chunk retrieval
- Ultra-short prompt
- Best for: Quick facts, definitions
Simple Mode (15-30 seconds)
- Standard model (llama3.2)
- 1 chunk, concise prompt
- Best for: Standard questions
Complex Mode (30-60 seconds)
- Standard model (llama3.2)
- 2 chunks, detailed prompt
- Best for: Deep explanations
The system automatically routes questions to the appropriate mode based on complexity detection keywords.
The Surprise
Testing revealed something unexpected: "Complex" mode was often faster than "Simple" mode (9.8s vs 10.7s). The lesson? Always test your assumptions. Prompt length matters less than you'd think when the bottleneck is generation speed.
Final Verdict: Lite mode was dropped. The 0.3-second speed gain wasn't worth the 33% quality loss. The optimized two-tier system (simple/complex) became the standard.
Going Global: The Multilingual Challenge
Here's where AXAM's mission truly came to life. East Africa is multilingual: English, French, Swahili, Luganda, Kinyarwanda, and dozens more languages. Students needed to ask questions in their preferred language.
The Technical Challenge
The embedding model (EmbeddingGemma) could understand 100+ languages, but the response-generation model (llama3.2) could only speak about 20-30 well. How do you bridge that gap?
The Solution: Language Detection + Cross-Lingual Retrieval
The system was enhanced to:
- Detect the question's language (using langdetect library)
- Retrieve relevant content even if it's in a different language (EmbeddingGemma's strength)
- Instruct the LLM to respond in the detected language
- Generate answers in the user's preferred language
Test Results
Five languages tested: English, Spanish, French, German, Portuguese
- Language detection: 100% accurate
- Cross-lingual retrieval: Working perfectly (Spanish question → English content → Spanish answer)
- Answer quality: Excellent across all languages
The Speed Problem (Again)
Initial multilingual tests showed 60-80 second response times—6-8x slower than expected. Why?
The culprit: An overly verbose system prompt (650 tokens). The prompt included:
- XML-style tags
- Repetitive instructions
- Excessive methodology lists
- Multiple reminders about language
The Optimization
The prompt was ruthlessly trimmed:
Before: 650 tokens
After: 90 tokens
Reduction: 86%
Expected improvement: 4-5x faster responses
Key insight: LLMs don't need verbose instructions. They focus on the first and last sentences. Everything in between is often wasted tokens.
The Cloud Temptation: Why Simpler Is Sometimes Better
At one point, the allure of Google Colab's free T4 GPU seemed irresistible. Imagine: 3-5x faster inference, free access, professional cloud infrastructure.
The Reality Check
Setting up cloud deployment would require:
- Uploading vector databases to Google Drive
- Installing Ollama on Colab (re-downloading 2GB of models per session)
- Setting up tunnels if using PC-based Ollama
- Managing session timeouts (12 hours max)
- Dealing with network latency
The Decision
Stay local. Here's why:
- Simplicity: Everything already works on the local machine
- Reliability: No session timeouts, no re-downloads
- Offline-first: The whole point of AXAM is working without internet
- Cost: Free locally vs. potential cloud costs for production
- Development speed: No upload/download delays
Sometimes the fancier solution isn't the better solution. The local setup was simpler, more reliable, and perfectly adequate for the use case.
Production Ready: Organizing for Deployment
The final phase involved structuring the code for real-world use. The notebook exploration phase had served its purpose—now it was time for production-quality code.
The Structure
Two key files emerged:
ingest.py (Data preparation)
- Load transcripts
- Clean VTT formatting
- Create chunks
- Generate embeddings
- Build vector database
answer.py (Query interface)
- Connect to vector database
- Detect language and complexity
- Retrieve relevant context
- Generate answers
- Support conversation history
The History Feature
One crucial addition: conversation memory. Students don't ask isolated questions—they have dialogs. The system needed to remember context:
Student: "What is probability?"
AXAM: [explains probability]
Student: "How does it relate to expectation?"
AXAM: [uses previous context to explain the connection]
Student: "Can you give an example?"
AXAM: [provides example building on previous discussion]
This context awareness transforms AXAM from a search engine into a teaching assistant.
Lessons Learned: What Really Matters
After weeks of development, testing, and optimization, several key insights emerged:
1. Quality Compounds at Scale
A 9-point MTEB improvement might seem small in isolation, but across 117,000 chunks and thousands of student queries, it translates to 23% more students getting correct answers. Small improvements in foundational components create massive downstream effects.
2. Data Cleaning Is Unglamorous But Crucial
Removing VTT formatting reduced file sizes by 78% and dramatically improved retrieval quality. The time spent on data preparation pays dividends every single query.
3. Prompt Engineering > Prompt Length
A 650-token verbose prompt performed no better than a 90-token focused one—but processed 4-5x slower. Clarity beats verbosity every time.
4. Test Your Assumptions
"Complex" mode being faster than "Simple" mode defied expectations. Always measure; don't assume.
5. Simpler Deployments Win
The cloud solution looked attractive but added complexity without clear benefits. For offline-first applications, local deployment makes sense.
6. Multilingual Isn't Optional
For global education tools, language support isn't a feature—it's a requirement. Students learn best in their native language.
7. Speed Matters More Than You Think
The difference between 60-second and 15-second responses isn't just convenience—it's the difference between 40 students getting help in a class period or only 10 getting help.
The Impact: What This Means for Students
Let's return to that student in rural Uganda. With AXAM running on a basic laptop or tablet:
- No internet required: The entire MIT OpenCourseWare library lives locally
- Ask in any language: Question in Swahili, Luganda, or English—all work
- Instant expertise: 15-second response times make learning interactive, not frustrating
- Quality answers: 23% improvement in retrieval accuracy means better learning outcomes
- Conversation memory: Build on previous questions like chatting with a tutor
The technical decisions—embedding models, chunk sizes, prompt optimization—all serve this singular goal: making world-class education accessible where it's needed most.
Looking Forward: What's Next for AXAM
The foundation is solid, but the journey continues:
Short-term priorities:
- Index the full 7,798-document corpus (currently tested with 10)
- Benchmark performance across different question types
- User testing with actual students in East Africa
- Fine-tune based on real-world feedback
Medium-term goals:
- Expand beyond MIT OCW to other open educational resources
- Add support for more local languages (Kinyarwanda, Luganda, etc.)
- Develop offline-friendly visualizations and diagrams
- Create teacher-facing dashboards for tracking student progress
Long-term vision:
- Partner with schools across East Africa for pilot programs
- Explore solar-powered hardware for maximum off-grid capability
- Build community contribution features where local teachers can add content
- Scale to other regions with similar connectivity challenges
Takeaways for Builders
If you're working on AI systems, particularly for education or resource-constrained environments, here's what matters:
Technical Decisions
- Choose embedding models based on deployment constraints, not just benchmarks
- Clean your data ruthlessly—garbage in, garbage out is real
- Optimize for your bottleneck—in CPU-only environments, that's generation speed
- Test multilingual support early if it's in your roadmap
- Measure everything—your intuitions about performance are probably wrong
Product Philosophy
- Simpler is usually better—fight the urge to over-engineer
- Design for your actual users, not the ideal user in your head
- Speed is a feature—optimize relentlessly
- Offline-first matters for billions of people globally
- Quality compounds—invest in foundational improvements
Development Process
- Prototype in notebooks, deploy in scripts
- Document your decisions—you'll forget why you made them
- Test assumptions constantly—what seems obvious often isn't
- Iterate based on real metrics, not feelings
- Sometimes good enough is good enough—ship and improve
The Bigger Picture
AXAM represents something larger than a technical project. It's a reminder that the best AI applications aren't always the most sophisticated—they're the ones that solve real problems for real people.
While tech companies race to build ever-larger models requiring ever-more computing power, there's a parallel challenge: How do we make powerful AI accessible to the billions of people without reliable internet, expensive hardware, or unlimited data plans?
The answer isn't one breakthrough—it's a thousand small optimizations. It's choosing a 227MB model instead of a 2.2GB one. It's cleaning VTT formatting to save 78% storage. It's trimming prompts from 650 to 90 tokens. It's choosing local deployment over cloud when it makes sense.
Each decision, each optimization, each test brings world-class education closer to students who need it most.
And that's worth far more than any benchmark score.
Closing Thoughts
Building AXAM has been a masterclass in practical AI engineering. Not the kind taught in research papers or online courses, but the kind learned through late nights debugging Ollama versions, discovering VTT formatting issues, and realizing that your "obviously faster" optimization is actually slower.
For Emmanuel and the AXAM project, the technical journey is just beginning. But the foundation is solid: a multilingual, offline-capable, optimized RAG system ready to bring MIT-level education to students across East Africa.
The next chapter? Putting it in students' hands and seeing what they can learn.
Because the measure of an AI system isn't how impressive it looks in a demo—it's how many lives it changes when the internet goes down.
About the Project: AXAM is being developed as part of the NSF I-Corps program, with deployment planned for schools in Uganda and Rwanda. The project combines MIT OpenCourseWare content with locally-optimized AI models to create an offline educational assistant.
Technical Stack: EmbeddingGemma-300M (embeddings), llama3.2 (generation), ChromaDB (vector store), Ollama (model serving), LangDetect (language detection), running on consumer-grade hardware optimized for offline operation.
Comments
Post a Comment