In 2024, Large Language Models (LLMs) and Generative AI tools like ChatGPT or Copilot are on everyone’s mind in the tech and business world. Many are wondering about the future capabilities and use-cases of AI for themselves, their business, or humanity as a whole. But as is often the case with new technologies, the best way to find out what you can do with them is to get your hands on them, play with them, and try them out for yourself. That’s true not just for engineers, but for businesses as well.

The black box nature of AI models and the apparent complexity associated with their use seem to have deterred many from more serious experimental ventures.

The good news are, that as of 2024, the AI ecosystem has evolved to a point where you don’t have to be an AI engineer or data scientist to deploy, test and play with Generative AI. Thanks in big parts to open source projects, we now have standardized workflows, powerful libraries and familiar looking APIs to work with. As you can see in the example at the end of this blog post, it’s now fairly easy to take your first steps building an AI app.

The goal of this blog post is to give you all the information you need to create your first simple AI application. A few useful resources and important concepts are introduced at the beginning. At the end of the post there is an example implementation of an AI tool that combines unstructured customer feedback into a structured list of tasks.

The example code can be found on Github.

Self-Hosted Large Language Models

For a few months now, lighter models that can be run on your own hardware have been available alongside commercial models such as GPT-4, Claude or Gemini. All concepts and examples in this text apply to both types of models. In particular, LangChain allows commercial and local models to be used interchangeably. In this article, however, the focus is on locally executable models, as they have seen huge improvements in the recent past, making them similarly powerful in many ways. In addition, using local models ensures that potentially sensitive data does not leave the local network.

What resources and tools should I know?

Huggingface: The Open AI Community

Huggingface is a thriving community that provides an informative overview and structured collection of machine learning models, datasets, libraries and information.

Here you will find an overview of all the machine learning models out there, categorized by their use cases together with a lot of information and metadata. You will also find instructions on how to deploy these models locally or using one of the big cloud providers like Google Cloud, Microsoft Azure or AWS.

Within a vast amount of information, there are two resources that are particularly interesting for beginners. The Inference API and the transformers library.

The term inference just means that we take a trained model, feed our own data into it and let it infer a resulting answer. The inference API lets you infer a great number of models hosted on Huggingface servers without the need to set up anything on your own, just using plain HTTP.

The transformers library provides a unified API for machine learning libraries like PyTorch, Tensorflow, and JAX. It provides many tools for tasks like Natural Language Processing (NLP) or the training, finetuning and inference of existing models.

Ollama: A Docker-Like LLM-Runtime

Ollama is a tool that makes it easy for engineers to manage and run a number of popular publicly available LLMs such as Llama3 or Mistral. The command-line tool provides a Docker-like API that makes it easy to download (pull), manage, and run different versions of a model. It also provides tools for monitoring and logging. It’s a good choice for experimenting with prompts and LLMs locally, or for running one or more models on a production machine.

Langchain: A Toolkit To Build GenerativeAI Apps

LangChain is a good choice for developers looking for a comprehensive framework to quickly develop AI applications or services in Python or TypeScript. The project consists of a set of open source libraries that provide many simple building blocks for AI applications. LangChain provides many building blocks needed to build AI pipelines […] as well as third-party integrations, e.g. to integrate vector databases for context information or to connect to an Ollama instance and use a running model.

How To Run My Local LLM?

Many tools can be used to run and infer a local model. However, Ollama’s simple API makes it easy to get started and test prompt formats and models via command line. It also runs as an HTTP server on port 11434 and is therefore a good choice to run a monitored production instance on a dedicated machine that can be accessed by other applications and services.

How Do I Provide Information To My AI System?

Most LLMs are general purpose language models, which means they have been trained to handle one or several types of language, like natural language or programming languages. Training is also one way to hand factual information to a model, although - at least for now - it is not a practical approach for most businesses because it’s a very expensive and complex process.

Passing Information In The Prompt: The Context Window

The easiest way to provide information to a model is to pass it directly to the model as part of the input prompt. This way it is possible to provide additional and contextual information to the model along with the actual prompt. In the example later in this text, we let the LLM summarize some feedback mails, so that these mails are passed to the LLM as context along with the actual prompt. The maximum input length that can be passed to a model to generate a response is called the context window, and it is measured in the number of tokens - where a token is a sort of building block of a word or phrase.

Not surprisingly, one of the most important model benchmarks today is the so-called context window length, since larger context windows allow models to consider more information for their tasks.

Despite recent advancements in increasing context windows, most of the time they are still too small to contain a very large number of documents…, context windows are still too small to contain a large number of documents or a large code base. To overcome this, technologies like RAG (Retrieval-Augmented Generation) have been developed that allow models to dynamically access the information they need.

What is RAG (Retrieval Augmented Generation)?

The concept is straightforward: if a model’s context window is too small to accommodate the prompt along with all relevant information, we apply a filter. We retrieve only the information deemed relevant (based on the prompt), augment the prompt by adding this selected information, and then generate a response.

Selecting and retrieving relevant information can be arbitrarily complex and involve varying criteria and technologies. You may decide to only include data that is the most recent or seems most relevant.

However, a key advantage of LLMs is their ability to understand unstructured queries, such as user input. In this case, vector databases are often used to find other unstructured information that is semantically close to that given in a query.

What are Embeddings, Vectorization and Vector Databases?

How do we determine if two pieces of information are similar to each other in terms of their meaning? How do we know, for example, that two text fragments talk about the same thing even if they use different words to do so? Or that two images both include a dog even if one is a photo and the other one a comic drawing? Or that a short audio sample recorded in a noisy bar includes is recognized as a liked song (think about shazam). These kinds of problems are commonly solved by a technology called vectorization.

Vector embeddings in a vector space

‍ Imagine a high dimensional world in which every concept or idea we can express has its place. And more than that: On a map of this world, things that we deem to belong together will be close together and things that we think are something else entirely are further apart.

For example, in one corner of this world we will find the concept of a dog and somewhere nearby we might find the concept of a puppy, a cat or a fish. Neutron stars, pancakes and Meryl Streep will also be on this map, but somewhere else entirely - and likely not very close to each other.

This world and its map are usually created using machine learning. A vectorization algorithm takes any piece of data - text, image, audio,… - and returns a very long set of numbers (a vector or embedding) that represents the coordinates of that thing on our map. This way for example, if we have a piece of text - like a prompt - it can be vectorized. The resulting coordinate vector can then be used to find other vectors nearby.

Vector databases are used to store and index vectorized data. A common use case is to provide a vector and return vectors that are nearest neighbors - and therefore represent something with a similar meaning. In a real world example, we might take all the factual information an AI might need, cut it into pieces, vectorize it and store it in a vector database. An incoming prompt will also be vectorized and used to search for nearby vectors, representing information that might be relevant.

For your first attempts, it might not be necessary to set up a new, dedicated vector database. Many databases like PostgreSQL, Redis or ElasticSearch have vectorization plugins and many machine learning frameworks like LangChain come with in-memory vector stores that should be sufficient for simpler use cases.

Do I Have To Fine Tune My Model?

The answer to this question is most likely no. Training and fine tuning models is an expensive and complex task. There is also a real risk to cause unintended behavior that might not become apparent until it’s too late. For most use cases it is probably enough to choose the right pre-trained model and provide specific information and context via RAG or directly in the context window.

Example: How To Use AI To Analyze And Summarize Customer Feedback

Let’s take a more concrete example. We want to build an AI tool that will help customer support by summarizing customer feedback emails. This could be implemented using many technologies, but because it is easy and a standard solution in AI and data, we choose Python with LangChain and Ollama.

1) Summarize Feedback Mails Into A Structured List Of Tasks

As preparation, Ollama must be installed and a suitable model downloaded. In this example, we use mistral: ollama pull mistral. Then we set up a simple python script where we choose the model, build a prompt template and combine both with an output parser to create and invoke our LLM chain:

from  langchain_community.llms  import Ollama
from  langchain_core.prompts  import  ChatPromptTemplate
from  langchain_core.output_parsers  import  StrOutputParser

llm = Ollama(model="mistral")
prompt = ChatPromptTemplate.from_messages([(
	"system",
	"You are an assistant and your task is to read and summarize emails | containing feedback by people using our facilities.\
	For every technical issue, describe the issue and provide a list of all the rooms where service is required.\
	Add every equipemnt request to a list with entries formatted | '<equipment>: <location> (<time>)'.\
	Please do not include anything that wasn't in the emails. \
	Here are the most recent emails: {data}",
)])
output_parser = StrOutputParser()
# create a pipeline
chain = prompt  |  llm  |  output_parser
print(chain.invoke({"data": joined_emails}))

‍A classic LLM takes a simple input prompt and produces an output. However, some modern models, such as GPT-4, have been trained as chat models, which means that they expect a series of messages as input. If we were to accept user input, that input would usually not go directly into the model. Instead, it would be embedded in a prompt template with some additional information.

In this example, this is done via the ChatPromptTemplate which accepts a number of messages formatted as (, ) tuples. System messages are used to provide context and define the overall behavior of the model. For our use case, we only need one system message. The string provided as the message behaves like a formatted Python string. The {data} attribute is provided when the chain is invoked.

LangChain provides more detailed information about the concepts behind prompt templates and messages.

What exactly we pass as the data parameter - joined_emails - depends on the model and the data itself and should be experimented with. Some small oder medium-sized models understand JSON well, others XML and others just plain natural language. In this case, joined_emails is just a string containing one mail per line with headers for subject and body:

# Email dataset from https://figshare.com/articles/dataset/Email_Dataset_by_Department/5765376
#
# joined_emails looks like this:
# "
# Subject: WIFI not working
# Body: Dear Sir/Ma'am, The wifi in my room is not working. I live in Room 7 of SH-3. Please help me with this. Thanks, XYZ
# Subject: WiFi Inaccessible
# Body: Dear All, I can't access the wifi through my phone. Please help. With best regards, Adit
# Subject: Wifi not working
# Body: Dear Sir, I can't access the wifi from my laptop. Please help me fix this. Best, Himanshu
# Subject: Wifi broken
# ...
# "

The sample dataset contains over 90 emails. We start by using only the top 15 as the most recent ones and the model produces the following output:‍

1. WIFI ISSUES:
   - Room 7, SH-3: WIFI not working (XYZ)
   - Multiple locations (Adit, Himanshu, Deb, Jai): Unable to access WIFI on devices

2. EQUIPMENT REQUESTS:
   - Room no. 2 in the new academic block: Clicker for Prof. Kothari's lecture (Vijay) (5 PM)
   - AC-100: Projector remote for talk (Ayush) (2 PM)
   - LR-106: Projector remote for lecture (Mohit) (10 PM)
   - TR-106: Presentation clicker required (Rohit) (7 PM)
   - AC-109: Remote controller for projector (Hari) (11 AM)
   - Room no. 25: Projector repair (Tom) (No specific time mentioned)
   - TR-106: Speakers repair (ABC) (No specific time mentioned)
   - LR-205: Microphone required (Jeet) (10 PM)
   - Room no. 230: Replacement mic (Alvin) (No specific time mentioned)
   - LT-106: Speakers repair (Dot) (No specific time mentioned)

‍This is quite amazing! Although the AI doesn’t strictly follow our example template, it gives us a structured list of all the issues and where and when they need to be fixed. Moreover, it included every issue and did not make anything up! Note: Names like “XYZ” and “ABC” are actually part of the sample dataset. Please be aware that although the model works quite reliably in this example, such applications are always susceptible to unpredictable behavior. Even when thoroughly tested, the results must be taken with caution!

2) Ask Questions About Feedback Mails Using RAG

Let’s say there are too many feedback emails to fit them all into our model’s context window. To enable the AI to answer questions about the content of these mails, we need to make sure it has access to the ones that might be relevant to answer the question. As explained above, we can do this using RAG.

First, all mails are vectorized and stored in a vector database. For simplicity, we use an in-memory vector store instead of a vector database.

loader = CSVLoader(file_path="./example_data/email-dataset-unclassified.csv")
documents = loader.load()
vector_store = FAISS.from_documents(documents, OpenAIEmbeddings())

To demonstrate, here we use OpenAI’s API to convert our feedback emails into vectorized embeddings, which are then stored in the vector store. Although OpenAI’s GPT is a very powerful model for creating embeddings, this could also be done using a local LLM like llama3 or mistral. In addition, there are pure embedding models that can be used specifically for this task.

In this example, the CSVLoader creates one document per line in the CSV file, so in effect one document per email. Since these emails are quite short and tend to cover a single topic, there is no need to split them into smaller documents. Longer texts such as websites, books, or multi-page documents may need to be split before being vectorized. The performance of a RAG system depends critically on the underlying model used to create the embeddings and how the documents are split.

Because all documents must be processed by the LLM, vectorization can take a while.

Now, when a user prompts the model with a specific question, we can use that prompt as a query to the vector store to retrieve related emails and provide them as context for our actual LLM prompt:

user_query = "Which rooms are missing equipment?"

llm = Ollama(model="mistral"
prompt = ChatPromptTemplate.from_template("""
You are an assistant helping with customer feedback.
Answer the following question based only on the emails in the provided context:

<context>{context}</context>

Question: {input}""")
chain = prompt  |  llm

# use the user query to load related documents and concatenate them to a context string
related_documents = vector_store.as_retriever(search_kwargs={"k": 12}).invoke(user_query)
context = [(d.page_content + "\n") for  d  in  related_documents]
res = chain.invoke({"input": user_query, "context": context})

print(res)

For this showcase, we load documents manually and concatenate them into a context string that is then used in the prompt. LangChain also has ways to do this more elegantly by combining these operations into a chain.

The parameter k=12 instructs the vector store to always return 12 results. The number of results and their ranking is also crucial for retrieving the right documents for the query context.

In a scenario where texts have been split into smaller documents prior to vectorization, an extra step may be required to fetch the most relevant original documents based on the snippets returned by the vector store.

Challenges in Generative AI Development And What To Do Next

There are a few more things to consider.

It should be noted that running a local model yourself on an in-house machine or in the cloud can be challenging. Although these models have seen remarkable improvements in performance, they require a lot of VRAM and GPU power to process data quickly. These setups are still hard to find, even with cloud providers. While it may be possible to run occasional Generative AI jobs, such as analyzing unstructured data, on your own infrastructure, this may not be possible for a chatbot.

Even if you plan to run a local model in production, it may be a good idea to use the OpenAI model for development. It’s API will cost you pennies and its performance, capabilities, and reliability will allow you to focus on your business logic.

Be aware that while LangChain provides a unified API for all models, they can work very differently under the hood. Not only their expertise and training, but also the format in which they best understand input and context data can be very different.

Although the example code here looks fairly simple, this is just where the real work begins, including:

All in all, thanks to the work of the open source community, the first steps with AI and LLM are quite easy today. A few lines of Python code are enough to start experimenting and then gradually delve into the complexity of the topic.

Check out the example code on Github.