Building an Offline AI Teaching Assistant: Pt 1

 

How one graduate student turned 7,600 educational videos into an intelligent, offline learning companion for resource-constrained schools


The Dream That Started With a Question

Emmanuel sat in his data analytics class at Yeshiva University's Katz School, watching his professor explain neural networks. As President of the Katz African Students Association, he couldn't help but think about students back home in Uganda and Rwanda—brilliant minds with limited access to quality educational resources.

"What if," he wondered, "we could package MIT's entire course library into something that works without internet, runs on basic computers, and answers student questions like a patient teaching assistant?"

That question launched a month-long technical odyssey that would teach him more about AI, education, and real-world constraints than any textbook ever could.


The Raw Material: 7,600 Hours of MIT's Best

Emmanuel's starting point was remarkable: complete transcripts from 7,600 MIT OpenCourseWare videos. Topics ranged from probability theory and differential equations to computer science fundamentals—exactly what undergraduate students in East Africa needed but rarely accessed.

The challenge? These weren't neatly organized textbooks. They were conversational lecture transcripts, sometimes messy, often context-dependent, totaling roughly 380 million characters of educational content.

His goal seemed simple: build a system where a student could ask "What is conditional expectation?" and receive a clear, accurate answer drawn from these lectures—all without touching the internet.

Simple in concept. Brutally complex in execution.


First Decision: The Technology Stack

Emmanuel faced an immediate fork in the road. The AI industry runs on two tracks:

The Cloud Path: Use services like OpenAI's GPT-4, pay per query, require constant internet connectivity. Fast, powerful, expensive, and completely unsuitable for schools with unreliable electricity and no broadband.

The Local Path: Download AI models, run everything on local computers, no internet required after setup. Slower, more complex, but perfectly aligned with his mission.

The choice was obvious. Emmanuel committed to building a completely offline system.


The Architecture: RAG Explained Simply

Emmanuel discovered that modern AI question-answering systems use something called RAG—Retrieval-Augmented Generation. Think of it like an open-book exam for AI:

Step 1: The Retrieval
When you ask a question, the system searches through all available materials to find the most relevant sections—like a librarian pulling the right books off the shelf.

Step 2: The Generation
An AI model reads those relevant sections and generates a natural language answer—like a tutor who's just reviewed the material explaining it to you.

This two-step dance would become the heart of his system. But first, he needed to solve a fundamental problem: computers don't naturally understand text the way humans do.


The Embedding Challenge: Teaching Computers to Understand Meaning

Here's where things got interesting. Computers need to convert text into numbers to work with it. But not just any numbers—they need numbers that capture meaning.

Consider these two sentences:

  • "The differential equation describes motion over time"
  • "This formula models how things change temporally"

To a simple word-counting program, these are completely different. To a human, they're saying nearly the same thing. How do you teach a computer to recognize that similarity?

Enter embedding models—AI systems trained to convert text into coordinates in a vast mathematical space. Similar meanings cluster together, like cities on a map. "Machine learning" and "artificial intelligence" end up close to each other. "Banana" and "quantum mechanics" don't.

Emmanuel tested several options:

all-MiniLM-L6-v2: A compact model (80MB) that converts text into 384-dimensional coordinates. Fast, efficient, good enough for educational content.

mxbai-embed-large: A more sophisticated model (670MB) producing 1024-dimensional coordinates. Better quality, but seven times larger.

OpenAI's text-embedding-3-large: State-of-the-art quality (3072 dimensions), but requires internet and costs money per query.

The decision came down to deployment constraints. Schools in Rwanda wouldn't have gigabytes to spare on embedding models. He chose all-MiniLM-L6-v2—small enough to fit on a USB drive, good enough for the job.


The Chunking Dilemma: How to Slice 380 Million Characters

You can't feed an entire MIT course into an AI model at once. Models have limits—context windows measured in thousands of tokens, not millions. Emmanuel needed to divide his massive text corpus into digestible pieces.

But how? Random cuts would split sentences mid-thought. Dividing by paragraphs created wildly inconsistent sizes. The solution was token-aware chunking—cutting text at precise 1000-token boundaries with 200-token overlaps to preserve context.

Why 1000 tokens? It's a sweet spot:

  • Small enough to be specific and focused
  • Large enough to contain complete concepts
  • Overlaps ensure no idea gets cut in half

This turned his 7,600 videos into roughly 95,000 chunks—95,000 pieces of knowledge, each precisely sized and ready to be searched.


Building the Vector Database: A Library of Coordinates

With chunks created and an embedding model chosen, Emmanuel faced the next challenge: storage and search.

He needed a database that could:

  • Store 95,000 text chunks
  • Store their 384-dimensional coordinate representations
  • Find the most relevant chunks for any question in milliseconds
  • Work completely offline
  • Fit on modest hardware

ChromaDB emerged as the perfect fit. Unlike traditional databases that match exact keywords, vector databases find semantic similarity. Ask about "probability distributions" and it retrieves chunks discussing "random variable behavior"—even if the exact phrase differs.

The alternatives were tempting but impractical:

FAISS: Faster for millions of vectors, but required manual metadata management and more complex setup.

Milvus: Production-grade for massive scale, but overkill for 95,000 vectors and required server infrastructure.

ChromaDB offered the best balance: file-based (just copy a folder), fast enough (0.15 seconds to search 95,000 vectors), and simple to deploy.


The First Major Setback: 285 Seconds of Horror

Emmanuel wrote his first complete RAG pipeline. Retrieved three relevant chunks, built a comprehensive system prompt, sent everything to his chosen AI model (llama3.2), and waited.

And waited.

And waited.

4 minutes and 45 seconds later, the answer appeared.

His heart sank. In Rwanda, students would give up after 30 seconds. Teachers would assume the system crashed. This was completely unusable.

The metrics revealed the problem:

  • Retrieval: 0.02 seconds (perfect!)
  • Generation: 285 seconds (catastrophic)
  • Input tokens: 3,074 (massive context)
  • Processing speed: 1.5 tokens per second (glacial)

Understanding the Bottleneck: The CPU Reality Check

Here's what Emmanuel learned the hard way: Large language models are computationally expensive. When tech companies show you instant AI responses, there are usually massive GPU clusters behind the scenes.

His laptop had a decent CPU and 30GB of RAM—more than most school computers would have. Yet even this struggled. Why?

The Math of Language Models
Processing text through a neural network involves billions of mathematical operations. The calculation complexity grows quadratically with input length. Double your input, quadruple the processing time.

His 3,074 input tokens weren't just big—they were choking his CPU-only setup.

The hard truth: If it took 285 seconds on his relatively powerful machine, school computers with 4-8GB RAM and older processors might take 10-15 minutes per answer. That's not a learning tool; that's a patience test.


The Optimization Journey: From 285 Seconds to 19 Seconds

Emmanuel refused to give up. He systematically tested every possible optimization:

Attempt 1: Smaller Models

Switching from llama3.2 (2GB) to gemma3:270m (291MB) helped marginally but introduced quality concerns. The smallest model sometimes gave oversimplified or incorrect answers.

Attempt 2: Aggressive Quantization

AI models use numbers for calculations. Higher precision (16-bit, 32-bit) gives better quality but slower processing. Lower precision (4-bit, 2-bit) sacrifices a bit of accuracy for major speed gains.

His models were already 4-bit quantized. Going to 2-bit helped modestly but had diminishing returns—the bottleneck was elsewhere.

Attempt 3: The Breakthrough - Lite RAG

Emmanuel realized he was asking his system to do too much. Three chunks of context? That's 2,000+ tokens of material for the model to process just to answer one question. A long, detailed system prompt explaining the assistant's role? Another 1,500 tokens.

What if he stripped everything to essentials?

Lite RAG was born:

  1. Retrieve just ONE chunk instead of three (3x reduction)
  2. Minimal system prompt (50 tokens vs 1,500)
  3. Limit answer length to 150 tokens
  4. Total input: ~400 tokens instead of 3,074

The result? 18.9 seconds per answer.

Still not instant, but absolutely usable. Students could ask a question, wait about 20 seconds (less than asking a teacher), and get a quality answer.


The Quality Test: Does Lite RAG Still Teach?

Speed means nothing if accuracy suffers. Emmanuel tested his optimized system:

Question: "What is conditional expectation?"

Answer: "The conditional expectation of a random variable X given a random variable Y is the expected value of X when considering the knowledge of Y. It's denoted as E(X|Y) and represents the 'best guess' for X based on what we know about Y. In essence, it's the weighted average of all possible values of X, where the weights are determined by the probability distribution of X given Y."

Clear. Accurate. Pedagogically sound. Perfect for an undergraduate encountering the concept.

The system worked.


The Hybrid Solution: Instant + Smart

But Emmanuel pushed further. Why make students wait 19 seconds for common questions asked by thousands of students?

He designed a two-tier system:

Tier 1: Pre-Computed Answers (0.01 seconds)
Run the system overnight on common questions:

  • "What is a Fourier series?"
  • "How do differential equations work?"
  • "Explain conditional probability"

Store these answers in a simple database. When a student asks one of these questions, return the pre-computed answer instantly.

Tier 2: Live RAG (18.9 seconds)
For novel questions or unusual phrasings, fall back to the full RAG pipeline.

This hybrid approach meant 80-90% of queries would feel instant, while still handling the long tail of unique questions.


Deployment Architecture: From USB to Classroom

Emmanuel's final system fit on a 16GB USB drive:

  • Vector database: 1GB (all 95,000 chunks + embeddings)
  • Embedding model: 46MB (all-minilm)
  • Language model: 2GB (llama3.2)
  • Pre-computed Q&A: 500MB
  • Supporting code: 50MB

Total: ~3.5GB with room to spare.

A teacher could copy this to school computers. Students would access it through a simple web interface—no installation, no configuration, just open a browser and start learning.


The Technical Decisions That Mattered

Looking back, several choices proved critical:

1. No LangChain Dependency

The popular LangChain library offers convenient abstractions, but Emmanuel found it added 150MB of dependencies and subtle performance overhead. Building directly on ChromaDB and Ollama kept the system lean.

2. Token-Aware Chunking

Using tiktoken to split at exact token boundaries (rather than character or sentence boundaries) ensured clean context windows and predictable behavior.

3. Metrics-Driven Optimization

Emmanuel built comprehensive timing and token-counting into his system from day one. You can't optimize what you don't measure. Seeing "285 seconds" in hard numbers forced him to confront reality.

4. Lite RAG Over Fancy Variants

He researched Graph RAG, Agentic RAG, Corrective RAG—all interesting on paper. None solved his core problem: CPU constraints. Sometimes the simplest solution (retrieve less, generate less) beats sophisticated architectures.


Lessons From the Field

Lesson 1: Context is a Luxury

Western AI applications assume unlimited computing power. They retrieve five, ten, even twenty document chunks per query. In resource-constrained environments, one carefully selected chunk often suffices.

Lesson 2: Offline Demands Different Design

Cloud-based AI can hide latency behind loading animations and use massive parallel processing. Offline AI must be honest about constraints and creative with solutions (pre-computation, streaming responses, efficient models).

Lesson 3: Quality Doesn't Always Need Quantity

GPT-4 is brilliant. It's also huge, expensive, and online-only. Llama3.2 is smaller, free, offline, and for educational Q&A? Nearly as good.

Lesson 4: The Last Mile is Hardest

Going from "works on my laptop" to "works on a 5-year-old school computer in rural Rwanda" required more optimization than building the initial system. Deployment constraints matter more than theoretical elegance.


The Numbers That Tell the Story

Final System Performance:

Metric Value
Total documents processed 7,600 videos
Text corpus size 380 million characters
Chunks created 95,000
Vector database size 1 GB
Embedding dimensions 384
Average retrieval time 0.15 seconds
Average generation time (Lite RAG) 18.9 seconds
Pre-computed answer lookup 0.01 seconds
Total deployment size 3.5 GB
Minimum RAM requirement 4 GB
Internet required 0 bytes (fully offline)

What This Means for Education

Emmanuel's system represents something bigger than technical achievement. It's a proof of concept: high-quality AI-powered education can work offline, on modest hardware, at near-zero marginal cost.

Consider the implications:

For Students in Connected Regions:
Instant access to world-class teaching materials without internet costs or platform subscriptions.

For Students in Rural Areas:
Finally, educational equity. A student in rural Rwanda can ask the same questions and get the same quality answers as a student at MIT.

For Teachers:
An intelligent assistant that never tires, never runs out of patience, and can explain concepts twenty different ways until understanding clicks.

For Educational Systems:
Scalable quality. One good implementation can serve thousands of schools at minimal cost.


The Road Ahead

Emmanuel's AXAM platform (his NSF I-Corps project) continues to evolve. Current development includes:

  • Voice interface for students who prefer speaking to typing
  • Multilingual support (Kinyarwanda, Swahili, French)
  • Progress tracking to identify struggling students
  • Adaptive difficulty that adjusts explanations to student level
  • Offline sync for updates when internet is briefly available

But the core innovation remains: bringing AI-powered education to the students who need it most, without requiring infrastructure they don't have.


Reflections: What I Learned Building This

In Emmanuel's words:

"I started this project thinking the hard part would be the AI. Turns out, the hard part was accepting constraints.

In grad school, professors say 'assume infinite computing resources.' In Rwanda, I needed to assume a decade-old laptop with 4GB of RAM and no internet.

That constraint didn't limit the solution—it clarified it. It forced me to ask: What truly matters? Fast retrieval or perfect retrieval? Comprehensive answers or sufficient answers? Sophisticated architecture or reliable architecture?

Every optimization taught me something:

  • Going from 285 seconds to 19 seconds taught me ruthless prioritization
  • Choosing ChromaDB over Milvus taught me appropriate scaling
  • Picking llama3.2 over GPT-4 taught me 'good enough' beats 'perfect but unavailable'

The biggest lesson? Technology should serve people's real needs, not our imagined ideal scenarios. A 19-second answer that works in a Rwandan classroom beats a 1-second answer that requires fiber optic internet.

This isn't just about AI or education. It's about designing technology for the world as it is, not as we wish it to be."


Try It Yourself: The Open Source Component

While Emmanuel's full AXAM platform remains proprietary, he's committed to sharing the core RAG architecture. The principles here—offline embeddings, vector databases, Lite RAG optimization—work for any educational content.

Imagine building similar systems for:

  • Medical training in rural clinics
  • Legal aid in underserved communities
  • Agricultural knowledge for smallholder farmers
  • Technical training for trades and vocations

The template is proven. The technology is accessible. The need is universal.


Final Thoughts: Education as Leverage

There's a quote Emmanuel keeps on his desk, from educator Sal Khan: "Give every person on Earth access to a world-class education."

For years, that sounded aspirational. Now, it sounds achievable.

Not through massive infrastructure investments or satellite internet or one-laptop-per-child programs—though those help.

Through appropriate technology. Through understanding constraints. Through building systems that work for people where they are, not where we imagine them to be.

Emmanuel's journey from "7,600 videos" to "19-second answers on a USB drive" shows a path forward. Not the only path, certainly not the easiest path, but a real path.

And for a student in rural Rwanda asking "What is conditional expectation?" and getting a clear, patient, accurate answer nineteen seconds later?

That path changes everything.


Emmanuel continues his work on AXAM as an NSF I-Corps Fellow while completing his M.S. in Data Analytics at Yeshiva University. The platform is currently in pilot testing at select schools in Uganda and Rwanda, with broader deployment planned for 2026.


Key Takeaways:

  1. Offline AI is possible with careful model selection and optimization
  2. Constraints clarify design—resource limits forced better architecture
  3. Lite RAG works—one chunk, minimal prompt, limited output = 15x faster
  4. Pre-computation scales—answer common questions once, serve infinitely
  5. Deployment matters more than demos—what works on your laptop isn't the end goal
  6. Educational equity through technology—AI can democratize access to quality teaching

Further Reading:

  • MIT OpenCourseWare: ocw.mit.edu
  • ChromaDB Documentation: docs.trychroma.com
  • Ollama Local Models: ollama.ai
  • NSF I-Corps Program: nsf.gov/icorps

Have you built offline AI systems? Deployed education technology in resource-constrained settings? Share your experiences and lessons learned in the comments below.

Comments

Popular posts from this blog

Briefing Document: The State of AI - How Organizations Are Rewiring to Capture Value (McKinsey, March 2025)

TD Bank’s Data Awakening: What Every Business Can Learn About Enterprise Transformation