Post

Building a Local AI Chatbot with Ollama and Python

Learn how to build a private, local AI chatbot in Python using Ollama. Covers installation, streaming responses, conversation memory, and a simple Gradio web UI.

Building a Local AI Chatbot with Ollama and Python

Why Run a Chatbot Locally

No API key, no per-token cost, no data leaving your machine. Ollama packages quantized open models behind a simple local API — you get most of the convenience of a hosted LLM API without sending a single request over the internet.

Installing Ollama and Pulling a Model

1
2
3
4
5
6
7
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (4-bit quantized, ~4.7GB)
ollama pull llama3.1:8b

pip install ollama gradio

Your First Local Completion

1
2
3
4
5
6
7
import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain recursion in two sentences."}],
)
print(response["message"]["content"])
1
2
Recursion is when a function calls itself to solve smaller instances of the same problem.
It needs a base case to stop, or it will call itself forever and crash with a stack overflow.

No API key, no network call beyond localhost:11434 — the model is running entirely on your machine.

Streaming Responses

For a chatbot UI, you want tokens to appear as they’re generated, not all at once after a long wait:

1
2
3
4
5
6
7
8
9
10
def stream_chat(prompt: str, history: list[dict] = None):
    messages = (history or []) + [{"role": "user", "content": prompt}]
    full_response = ""
    for chunk in ollama.chat(model="llama3.1:8b", messages=messages, stream=True):
        token = chunk["message"]["content"]
        full_response += token
        print(token, end="", flush=True)
    return full_response

stream_chat("Write a haiku about Python programming.")
1
2
3
Indented blocks
Snake-like syntax flows with ease
Bugs hide in whitespace

Adding Conversation Memory

Without history, every message is treated in isolation. Track the conversation as a growing list of messages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class LocalChatbot:
    def __init__(self, model: str = "llama3.1:8b", system_prompt: str = "You are a helpful assistant."):
        self.model = model
        self.history = [{"role": "system", "content": system_prompt}]

    def send(self, message: str) -> str:
        self.history.append({"role": "user", "content": message})
        response = ollama.chat(model=self.model, messages=self.history)
        reply = response["message"]["content"]
        self.history.append({"role": "assistant", "content": reply})
        return reply

    def reset(self):
        self.history = self.history[:1]  # keep only the system prompt

bot = LocalChatbot()
print(bot.send("My name is Alex."))
print(bot.send("What's my name?"))
1
2
Nice to meet you, Alex! How can I help you today?
Your name is Alex.

The second response only works because self.history carries the first exchange forward — exactly the same pattern as conversation memory in hosted-API chatbots.

Building a Simple Web UI with Gradio

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import gradio as gr

bot = LocalChatbot()

def respond(message, chat_history):
    reply = bot.send(message)
    chat_history.append((message, reply))
    return "", chat_history

with gr.Blocks() as demo:
    chatbot_ui = gr.Chatbot()
    msg = gr.Textbox(placeholder="Ask me anything...")
    msg.submit(respond, [msg, chatbot_ui], [msg, chatbot_ui])

demo.launch()

This launches a local web chat interface at http://127.0.0.1:7860 — no frontend framework required, just Python.

Limiting Context to Control Memory and Speed

Long conversations slow down every subsequent call because the full history gets reprocessed each time. Cap it:

1
2
3
4
5
6
7
8
def send_with_window(self, message: str, max_turns: int = 10) -> str:
    self.history.append({"role": "user", "content": message})
    system = self.history[0]
    recent = self.history[1:][-(max_turns * 2):]  # keep last N user/assistant pairs
    response = ollama.chat(model=self.model, messages=[system] + recent)
    reply = response["message"]["content"]
    self.history.append({"role": "assistant", "content": reply})
    return reply

Trimming to a sliding window keeps latency roughly constant instead of growing with every message in a long-running conversation.

Key Takeaways

  • Ollama runs quantized open models locally with a simple chat API — no API key or network dependency
  • Streaming responses (stream=True) makes a chatbot feel responsive instead of making users wait for the full reply
  • Conversation memory is just a growing list of {role, content} messages passed back on every call
  • Gradio gives you a working chat web UI in under 20 lines of Python
  • Trim conversation history to a sliding window to keep latency from growing indefinitely
  • Local models trade some quality for privacy and zero per-token cost — benchmark against a hosted API for your specific task
Khushal Jethava
Khushal Jethava

Machine Learning Engineer at Codiste, specializing in Generative AI, NLP, and Computer Vision. Building production AI systems with Python.

This post is licensed under CC BY 4.0 by the author.