How one graduate student spent weeks fine-tuning a RAG system to bring MIT-level education to students in Resource constraints—and the surprising lessons learned along the way.

The Challenge: Bringing World-Class Education Where the Internet Doesn't Reach

Picture this: You're a high school student in rural Uganda. The nearest university is hours away, internet connectivity is sparse at best, and data costs more than your family can afford. Yet somewhere on the internet, MIT has published thousands of hours of world-class lectures covering everything from calculus to computer science—completely free.

The problem? You can't access them.

This is the gap that Emmanuel, a graduate student at Yeshiva University's Katz School, set out to bridge with AXAM—an AI-powered educational platform designed to work entirely offline. Think of it as having a knowledgeable teaching assistant in your pocket, one that can answer questions about complex academic topics without needing an internet connection.

But building such a system turned out to be far more complex than simply downloading some AI models and calling it a day. What followed was a weeks-long journey of optimization, testing, and hard-won insights that transformed a promising concept into a practical solution.

The Foundation: Understanding RAG (Retrieval-Augmented Generation)

Before diving into the technical journey, let's understand what we're building. Traditional AI chatbots are like students who crammed for an exam months ago—they have general knowledge but can't reference specific materials when needed. RAG systems, on the other hand, are like students taking an open-book exam. They can:

Search through a library of documents to find relevant information
Retrieve the most relevant passages
Generate answers based on that specific context

For AXAM, this meant taking 7,798 MIT OpenCourseWare video transcripts, converting them into searchable chunks, and building a system that could intelligently find and explain relevant content when students ask questions.

Decision Point #1: Choosing the Right Embedding Model

The first major challenge? Deciding how to convert text into numbers that computers can understand—a process called "embedding."

The Starting Point

The initial system used a model called all-MiniLM-L6-v2. It was small (90MB), fast, and got the job done. But "good enough" wasn't the goal when the target audience consists of students who might get one shot at understanding a difficult concept.

The Research Phase

After extensive research into 2025's latest embedding models, several candidates emerged:

EmbeddingGemma-300M: Google's newest model, 308M parameters, 65.2 MTEB score
Qwen3-Embedding: 600M parameters, slightly better scores but 4x larger
BGE-M3 and multilingual-e5-large: Excellent quality but 2.2GB each—too large for resource-constrained deployments

The Aha Moment

The breakthrough came when analyzing what "accuracy" really means in practice. With a corpus of nearly 8,000 documents creating over 117,000 searchable chunks, even a 9-point improvement in MTEB scores (from 56.1 to 65.2) translates to approximately 23% more students getting correct answers to their queries.

Think of it this way: If 100 students ask "How does photosynthesis work?", the old model might correctly direct 56 of them to the right explanation. The new model? 65 students. That extra 9% might not sound revolutionary in a research paper, but it's nine more students who understand biology.

The Decision: EmbeddingGemma-300M won because it offered the best balance:

10x smaller than top competitors (227MB vs 2.2GB when quantized)
Only 1-3 points lower MTEB score than much larger models
4x longer context window than alternatives (2048 vs 512 tokens)
Support for 100+ languages including Swahili, French, and local East African languages
Mobile-optimized by Google for exactly this type of deployment

Challenge #2: The VTT Subtitle Problem

With the embedding model selected, the next step was preparing the data. This is where an unexpected problem emerged: the MIT transcripts were polluted with VTT (WebVTT) subtitle formatting.

What Was Wrong

VTT files don't just contain what people said—they contain timing information, positioning data, and crucially, progressive repetition. Here's what that looked like:

00:00:01.000 --> 00:00:02.000
hi everyone

00:00:02.000 --> 00:00:03.500
hi everyone welcome

00:00:03.500 --> 00:00:05.000
hi everyone welcome back

The same words repeated three times, inflating file sizes and creating noise in the data.

The Solution

A smart cleaning function was developed that could distinguish between VTT's progressive display (which should be removed) and natural repetition like "thank you, thank you" (which should be kept). The results were dramatic:

Before cleaning: 202,820 characters
After cleaning: 43,963 characters
Reduction: 78.3%

This wasn't just about storage—cleaner data means more relevant content in each chunk, leading to better retrievals and more accurate answers.

Decision Point #3: How to Chunk the Content

Here's a problem that doesn't have an obvious answer: When you have a 50-page transcript, how do you break it into pieces that AI can work with?

The Challenge

Too small (100-200 words): Fragments lose context, like reading random sentences from a textbook
Too large (2000+ words): The AI gets overwhelmed with information and loses focus
Just right: Captures complete thoughts while remaining digestible

The Solution

After testing, the team settled on chunks of approximately 1,500 tokens (roughly 6,000 characters or 1-2 paragraphs). This size:

Captured complete explanations of concepts
Fit comfortably within the embedding model's 2,048 token context window
Provided enough context without overwhelming the retrieval system

An overlap of 300 tokens between chunks ensured that concepts spanning chunk boundaries wouldn't be lost—like having the last sentence of one page repeat at the top of the next.

The Ollama Headache: When Technology Doesn't Cooperate

Not everything went smoothly. After successfully setting up the new embedding model and cleaning the data, the system... crashed.

The Error

500 Internal Server Error: this model does not support embeddings

The Investigation

The problem? Ollama version 0.9.2 was too old to properly support the new embedding model. What followed was a classic tech troubleshooting session:

Verify the model was actually downloaded (✓)
Test with curl commands (✗ - model claimed it couldn't embed)
Check version numbers (0.9.2 vs required 0.4.0+... wait, that doesn't make sense)
Realize version numbering had changed
Upgrade to 0.13.1
Success!

Lesson learned: Always check compatibility before assuming your code is wrong. Sometimes the tools themselves need updating.

The Moment of Truth: Does It Actually Work?

With everything configured, cleaned, and updated, it was time to test whether the system could actually retrieve relevant information.

Test Query

"What is conditional expectation?"

The Results

Result 1 (distance: 0.9666)
Source: Probability lecture
Text: "we will now go through an example which is essentially 
a drill to consolidate our understanding of the conditional 
expectation and the conditional variance..."

Perfect hit! The system found exactly the right lecture segment. But more importantly, look at the distance scores for results 2 and 3: 1.43. That clear separation (0.97 vs 1.43) showed the model could distinguish between highly relevant and somewhat relevant content—crucial for giving students the best possible answers.

The Speed Problem: When Good Enough Isn't Fast Enough

Early testing revealed a problem: The system worked beautifully but took 4 minutes and 45 seconds (285 seconds) to answer a single question. In a classroom of 40 students, that's... unacceptable.

The Investigation

The bottleneck was identified: retrieving 3 large chunks and using a verbose system prompt consumed too many tokens, leading to slow processing times on CPU-only hardware.

The Solution: A Three-Tier System

Lite Mode (5-10 seconds)

Smallest model (gemma3:270m)
1 chunk retrieval
Ultra-short prompt
Best for: Quick facts, definitions

Simple Mode (15-30 seconds)

Standard model (llama3.2)
1 chunk, concise prompt
Best for: Standard questions

Complex Mode (30-60 seconds)

Standard model (llama3.2)
2 chunks, detailed prompt
Best for: Deep explanations

The system automatically routes questions to the appropriate mode based on complexity detection keywords.

The Surprise

Testing revealed something unexpected: "Complex" mode was often faster than "Simple" mode (9.8s vs 10.7s). The lesson? Always test your assumptions. Prompt length matters less than you'd think when the bottleneck is generation speed.

Final Verdict: Lite mode was dropped. The 0.3-second speed gain wasn't worth the 33% quality loss. The optimized two-tier system (simple/complex) became the standard.

Going Global: The Multilingual Challenge

Here's where AXAM's mission truly came to life. East Africa is multilingual: English, French, Swahili, Luganda, Kinyarwanda, and dozens more languages. Students needed to ask questions in their preferred language.

The Technical Challenge

The embedding model (EmbeddingGemma) could understand 100+ languages, but the response-generation model (llama3.2) could only speak about 20-30 well. How do you bridge that gap?

The Solution: Language Detection + Cross-Lingual Retrieval

The system was enhanced to:

Detect the question's language (using langdetect library)
Retrieve relevant content even if it's in a different language (EmbeddingGemma's strength)
Instruct the LLM to respond in the detected language
Generate answers in the user's preferred language

Test Results

Five languages tested: English, Spanish, French, German, Portuguese

Language detection: 100% accurate
Cross-lingual retrieval: Working perfectly (Spanish question → English content → Spanish answer)
Answer quality: Excellent across all languages

The Speed Problem (Again)

Initial multilingual tests showed 60-80 second response times—6-8x slower than expected. Why?

The culprit: An overly verbose system prompt (650 tokens). The prompt included:

XML-style tags
Repetitive instructions
Excessive methodology lists
Multiple reminders about language

The Optimization

The prompt was ruthlessly trimmed:

Before: 650 tokens
After: 90 tokens
Reduction: 86%

Expected improvement: 4-5x faster responses

Key insight: LLMs don't need verbose instructions. They focus on the first and last sentences. Everything in between is often wasted tokens.

The Cloud Temptation: Why Simpler Is Sometimes Better

At one point, the allure of Google Colab's free T4 GPU seemed irresistible. Imagine: 3-5x faster inference, free access, professional cloud infrastructure.

The Reality Check

Setting up cloud deployment would require:

Uploading vector databases to Google Drive
Installing Ollama on Colab (re-downloading 2GB of models per session)
Setting up tunnels if using PC-based Ollama
Managing session timeouts (12 hours max)
Dealing with network latency

The Decision

Stay local. Here's why:

Simplicity: Everything already works on the local machine
Reliability: No session timeouts, no re-downloads
Offline-first: The whole point of AXAM is working without internet
Cost: Free locally vs. potential cloud costs for production
Development speed: No upload/download delays

Sometimes the fancier solution isn't the better solution. The local setup was simpler, more reliable, and perfectly adequate for the use case.

Production Ready: Organizing for Deployment

The final phase involved structuring the code for real-world use. The notebook exploration phase had served its purpose—now it was time for production-quality code.

The Structure

Two key files emerged:

ingest.py (Data preparation)

Load transcripts
Clean VTT formatting
Create chunks
Generate embeddings
Build vector database

answer.py (Query interface)

Connect to vector database
Detect language and complexity
Retrieve relevant context
Generate answers
Support conversation history

The History Feature

One crucial addition: conversation memory. Students don't ask isolated questions—they have dialogs. The system needed to remember context:

Student: "What is probability?"
AXAM: [explains probability]
Student: "How does it relate to expectation?"
AXAM: [uses previous context to explain the connection]
Student: "Can you give an example?"
AXAM: [provides example building on previous discussion]

This context awareness transforms AXAM from a search engine into a teaching assistant.

Lessons Learned: What Really Matters

After weeks of development, testing, and optimization, several key insights emerged:

1. Quality Compounds at Scale

A 9-point MTEB improvement might seem small in isolation, but across 117,000 chunks and thousands of student queries, it translates to 23% more students getting correct answers. Small improvements in foundational components create massive downstream effects.

2. Data Cleaning Is Unglamorous But Crucial

Removing VTT formatting reduced file sizes by 78% and dramatically improved retrieval quality. The time spent on data preparation pays dividends every single query.

3. Prompt Engineering > Prompt Length

A 650-token verbose prompt performed no better than a 90-token focused one—but processed 4-5x slower. Clarity beats verbosity every time.

4. Test Your Assumptions

"Complex" mode being faster than "Simple" mode defied expectations. Always measure; don't assume.

5. Simpler Deployments Win

The cloud solution looked attractive but added complexity without clear benefits. For offline-first applications, local deployment makes sense.

6. Multilingual Isn't Optional

For global education tools, language support isn't a feature—it's a requirement. Students learn best in their native language.

7. Speed Matters More Than You Think

The difference between 60-second and 15-second responses isn't just convenience—it's the difference between 40 students getting help in a class period or only 10 getting help.

The Impact: What This Means for Students

Let's return to that student in rural Uganda. With AXAM running on a basic laptop or tablet:

No internet required: The entire MIT OpenCourseWare library lives locally
Ask in any language: Question in Swahili, Luganda, or English—all work
Instant expertise: 15-second response times make learning interactive, not frustrating
Quality answers: 23% improvement in retrieval accuracy means better learning outcomes
Conversation memory: Build on previous questions like chatting with a tutor

The technical decisions—embedding models, chunk sizes, prompt optimization—all serve this singular goal: making world-class education accessible where it's needed most.

Looking Forward: What's Next for AXAM

The foundation is solid, but the journey continues:

Short-term priorities:

Index the full 7,798-document corpus (currently tested with 10)
Benchmark performance across different question types
User testing with actual students in East Africa
Fine-tune based on real-world feedback

Medium-term goals:

Expand beyond MIT OCW to other open educational resources
Add support for more local languages (Kinyarwanda, Luganda, etc.)
Develop offline-friendly visualizations and diagrams
Create teacher-facing dashboards for tracking student progress

Long-term vision:

Partner with schools across East Africa for pilot programs
Explore solar-powered hardware for maximum off-grid capability
Build community contribution features where local teachers can add content
Scale to other regions with similar connectivity challenges

Takeaways for Builders

If you're working on AI systems, particularly for education or resource-constrained environments, here's what matters:

Technical Decisions

Choose embedding models based on deployment constraints, not just benchmarks
Clean your data ruthlessly—garbage in, garbage out is real
Optimize for your bottleneck—in CPU-only environments, that's generation speed
Test multilingual support early if it's in your roadmap
Measure everything—your intuitions about performance are probably wrong

Product Philosophy

Simpler is usually better—fight the urge to over-engineer
Design for your actual users, not the ideal user in your head
Speed is a feature—optimize relentlessly
Offline-first matters for billions of people globally
Quality compounds—invest in foundational improvements

Development Process

Prototype in notebooks, deploy in scripts
Document your decisions—you'll forget why you made them
Test assumptions constantly—what seems obvious often isn't
Iterate based on real metrics, not feelings
Sometimes good enough is good enough—ship and improve

The Bigger Picture

AXAM represents something larger than a technical project. It's a reminder that the best AI applications aren't always the most sophisticated—they're the ones that solve real problems for real people.

While tech companies race to build ever-larger models requiring ever-more computing power, there's a parallel challenge: How do we make powerful AI accessible to the billions of people without reliable internet, expensive hardware, or unlimited data plans?

The answer isn't one breakthrough—it's a thousand small optimizations. It's choosing a 227MB model instead of a 2.2GB one. It's cleaning VTT formatting to save 78% storage. It's trimming prompts from 650 to 90 tokens. It's choosing local deployment over cloud when it makes sense.

Each decision, each optimization, each test brings world-class education closer to students who need it most.

And that's worth far more than any benchmark score.

Closing Thoughts

Building AXAM has been a masterclass in practical AI engineering. Not the kind taught in research papers or online courses, but the kind learned through late nights debugging Ollama versions, discovering VTT formatting issues, and realizing that your "obviously faster" optimization is actually slower.

For Emmanuel and the AXAM project, the technical journey is just beginning. But the foundation is solid: a multilingual, offline-capable, optimized RAG system ready to bring MIT-level education to students across East Africa.

The next chapter? Putting it in students' hands and seeing what they can learn.

Because the measure of an AI system isn't how impressive it looks in a demo—it's how many lives it changes when the internet goes down.

About the Project: AXAM is being developed as part of the NSF I-Corps program, with deployment planned for schools in Uganda and Rwanda. The project combines MIT OpenCourseWare content with locally-optimized AI models to create an offline educational assistant.

Technical Stack: EmbeddingGemma-300M (embeddings), llama3.2 (generation), ChromaDB (vector store), Ollama (model serving), LangDetect (language detection), running on consumer-grade hardware optimized for offline operation.

Building AXAM: A Journey from Concept to Reality - Optimizing AI for Offline Use Pt2

How one graduate student spent weeks fine-tuning a RAG system to bring MIT-level education to students in Resource constraints—and the surprising lessons learned along the way.

The Challenge: Bringing World-Class Education Where the Internet Doesn't Reach

The Foundation: Understanding RAG (Retrieval-Augmented Generation)

Decision Point #1: Choosing the Right Embedding Model

The Starting Point

The Research Phase

The Aha Moment

Challenge #2: The VTT Subtitle Problem

What Was Wrong

The Solution

Decision Point #3: How to Chunk the Content

The Challenge

The Solution

The Ollama Headache: When Technology Doesn't Cooperate

The Error

The Investigation

The Moment of Truth: Does It Actually Work?

Test Query

The Results

The Speed Problem: When Good Enough Isn't Fast Enough

The Investigation

The Solution: A Three-Tier System

The Surprise

Going Global: The Multilingual Challenge

The Technical Challenge

The Solution: Language Detection + Cross-Lingual Retrieval

Test Results

The Speed Problem (Again)

The Optimization

The Cloud Temptation: Why Simpler Is Sometimes Better

The Reality Check

The Decision

Production Ready: Organizing for Deployment

The Structure

The History Feature

Lessons Learned: What Really Matters

1. Quality Compounds at Scale

2. Data Cleaning Is Unglamorous But Crucial

3. Prompt Engineering > Prompt Length

4. Test Your Assumptions

5. Simpler Deployments Win

6. Multilingual Isn't Optional

7. Speed Matters More Than You Think

The Impact: What This Means for Students

Looking Forward: What's Next for AXAM

Takeaways for Builders

Technical Decisions

Product Philosophy

Development Process

The Bigger Picture

Closing Thoughts

Comments

Post a Comment

Popular posts from this blog

How DeepSeek Turned "A Picture is Worth 1,000 Words" into a Powerful AI Compression Algorithm.

Briefing Document: The State of AI - How Organizations Are Rewiring to Capture Value (McKinsey, March 2025)

TD Bank’s Data Awakening: What Every Business Can Learn About Enterprise Transformation