LangChain

LangChain is a bridge between developers and large language models. It is made up of:

Components
- LLM Wrappers
- Prompt Templates
- Indexes for information retrieval
Chains
- Assemble components to solve a specific task
Agents
- Allow LLMs to interact with their environment

Question Answering

import textwrap
from pathlib import Path

import bs4
from langchain import hub
from langchain.agents import AgentType, initialize_agent, load_tools
from langchain.chains import LLMChain
from langchain.document_loaders import PyPDFLoader, WebBaseLoader, YoutubeLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.llms import CTransformers, LlamaCpp
from langchain_community.vectorstores import FAISS, Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnablePick
from langchain_openai import ChatOpenAI, OpenAI, OpenAIEmbeddings

ROOT = Path().cwd().parent.parent

def print_wrapped(text: str, width: int = 80):
    print(textwrap.fill(text, width))

Basic Usage

def generate_pet_name(animal_type, pet_colour):
    llm = OpenAI(temperature=0.7)
    prompt_template_name = PromptTemplate(
        input_variables=["animal_type"],
        template="I have a pet {animal_type} and I want a cool name for it, it is {pet_colour} in colour. Suggest 5 cool names for my pet",
    )
    name_chain = LLMChain(llm=llm, prompt=prompt_template_name, output_key="animal_name")
    response = name_chain.invoke({"animal_type": animal_type, "pet_colour": pet_colour})
    return response

pet_name_response = generate_pet_name("dog", "brown")
print(pet_name_response["animal_name"].strip())

1. Copper
2. Bruno
3. Hazel
4. Rusty
5. Chestnut

Agents

llm = OpenAI(temperature=0.5)
tools = load_tools(["wikipedia", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

print(agent.agent.llm_chain.prompt.template)

Answer the following questions as best you can. You have access to the following tools:

Wikipedia: A wrapper around Wikipedia. Useful for when you need to answer general questions about people, places, companies, facts, historical events, or other subjects. Input should be a search query.
Calculator: Useful for when you need to answer questions about math.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Wikipedia, Calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}

result = agent.invoke(
    """
    What is the average age of a dog? 
    Look it up if you don't know it. 
    The answer should be an integer. 
    Multiply the age by 3
    """
)



> Entering new AgentExecutor chain...
 I should use the Calculator tool to calculate the average age of a dog
Action: Calculator
Action Input: (15 + 9 + 12 + 18 + 5) / 5
Observation: Answer: 11.8
Thought: I should multiply the age by 3 to get the answer in dog years
Action: Calculator
Action Input: 11.8 * 3
Observation: Answer: 35.400000000000006
Thought: I now know the final answer
Final Answer: The average age of a dog is approximately 35 years in dog years.

> Finished chain.

print(result["output"])

The average age of a dog is approximately 35 years in dog years.

Vector DBs

def create_vector_db_from_youtube_url(video_url: str) -> FAISS:
    loader = YoutubeLoader.from_youtube_url(video_url)
    transcript = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    docs = text_splitter.split_documents(transcript)

    embeddings = OpenAIEmbeddings()

    db = FAISS.from_documents(docs, embeddings)

    return db


def get_response_from_query(db, query, k=4):
    docs = db.similarity_search(query, k=k)
    docs_page_content = " ".join([doc.page_content for doc in docs])

    llm = OpenAI()

    prompt = PromptTemplate(
        input_variables=["question", "docs"],
        template="""
        You are a helpful assistant that that can answer questions about youtube videos 
        based on the video's transcript.
        
        Answer the following question: {question}
        By searching the following video transcript: {docs}
        
        Only use the factual information from the transcript to answer the question.
        
        If you feel like you don't have enough information to answer the question, say "I don't know".
        
        Your answers should be verbose and detailed.
        """,
    )

    chain = LLMChain(llm=llm, prompt=prompt)

    response = chain.invoke({"question": query, "docs": docs_page_content})
    answer = response["text"]

    return answer, docs

youtube_url = "https://youtu.be/VMj-3S1tku0?si=ei_FTn8tKzZVd0y0"
youtube_query = "What is a prompt template?"

db = create_vector_db_from_youtube_url(youtube_url)
response, docs = get_response_from_query(db, youtube_query)

response_lines = response.split("\n\n")

for line in response_lines:
    print_wrapped(line.strip())
    print()

A prompt template refers to a standardized format or structure for a prompt,
which is used to provide instructions or indicate what is expected for a
specific task or activity. In the context of the video transcript, the speaker
is discussing the use of neural networks and how they can be trained to perform
various tasks. The prompt template is an important aspect of this process, as it
provides a clear and consistent structure for the training data.

The speaker explains that neural networks are made up of a large number of
parameters, or "neurons", which work together to solve complex problems. These
neurons are organized in a structure that simulates neural tissue, and can be
trained using data from the internet. In order to effectively train a neural
network, the data must be presented in a standardized format, which is where the
prompt template comes in.

The prompt template is used to provide a consistent structure for the data,
which allows the neural network to learn and make predictions based on patterns
within the data. This is important because it allows the network to make
connections and recognize patterns across different datasets, which ultimately
leads to more accurate predictions.

The prompt template also plays a role in the training process by providing a
baseline for the network to compare against. The speaker explains that when
training a neural network, a small

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
)
docs = loader.load()

Our loaded document is over 42k characters long. This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

Now we need to index our 66 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.

vectorstore = Chroma.from_documents(documents=splits, embedding=GPT4AllEmbeddings())

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522

We need to define our logic for searching over documents. LangChain defines a Retriever interface which wraps an index that can return relevant Documents given a string query.

The most common type of Retriever is the VectorStoreRetriever, which uses the similarity search capabilities of a vector store to facillitate retrieval.

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

retrieved_docs

[Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='(3) Task execution: Expert models execute on the specific tasks and log results.\nInstruction:', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='judge the correctness of task results.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='Planning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemory', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='(2) Model selection: LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed.\nInstruction:', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'})]

prompt = hub.pull("rlm/rag-prompt-mistral")

prompt.template

"<s> [INST] You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. [/INST] </s> \n[INST] Question: {question} \nContext: {context} \nAnswer: [/INST]"

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF", filename="mistral-7b-instruct-v0.1.Q3_K_S.gguf"
)

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=1,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    verbose=True,
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/henrydashwood/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.1-GGUF/snapshots/731a9fc8f06f5f5e2db8a0cf9d256197eb6e05d1/mistral-7b-instruct-v0.1.Q3_K_S.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 11
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.95 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3017.97 MiB, ( 6453.66 / 12288.02)
llm_load_tensors: system memory used  = 3017.38 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: ggml.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/henrydashwood/.pyenv/versions/3.11.6/envs/py3116/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 12884.92 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 6709.66 / 12288.02)
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 6709.67 / 12288.02)
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   156.02 MiB, ( 6865.67 / 12288.02)
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
ggml_metal_free: deallocating

llm.invoke("Simulate a rap battle between Stephen Colbert and John Oliver")


llama_print_timings:        load time =    4476.29 ms
llama_print_timings:      sample time =      91.55 ms /   256 runs   (    0.36 ms per token,  2796.41 tokens per second)
llama_print_timings: prompt eval time =    4476.25 ms /    13 tokens (  344.33 ms per token,     2.90 tokens per second)
llama_print_timings:        eval time =   11734.44 ms /   255 runs   (   46.02 ms per token,    21.73 tokens per second)
llama_print_timings:       total time =   17380.26 ms

".\n\n[INTRODUCTION]\n\nStephen Colbert: (Entering the stage, microphone in hand) Ladies and gentlemen, boys and girls, welcome back to The Late Show! Tonight, we have a very special guest. He's an incredibly talented comedian who hosts one of the most brilliant satirical news shows on television. Please give it up for my friend, John Oliver!\n\n[AUDIENCE APPLAUSE]\n\nJohn Oliver: (Walking onto the stage with his signature deadpan expression) Thank you, Stephen. It's great to be here. I must say, your audience is quite... passionate.\n\nStephen Colbert: Well, they are indeed! But enough about me. Let's get down to business. You know what we do here at The Late Show - we engage in friendly rap battles, pitting two of the wittiest comedians against each other in a battle of rhymes and wit. Are you ready for this, John?\n\nJohn Oliver: (Pulling out a notepad) Alright, let's do this!\n\n[BATTLE BEGINS]\n\nStephen Colbert"

# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# llm = CTransformers(
#     **{
#         "model": "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
#         "model_file": "mistral-7b-instruct-v0.1.Q4_K_M.gguf",
#     }
# )

We’ll use the LangChain Expression Language (LCEL) Runnable protocol to define the chain, allowing us to - pipe together components and functions in transparent way - automatically trace our chain in LangSmith - get streaming, async, and batched calling out of the box

rag_prompt = hub.pull("rlm/rag-prompt-mistral")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs) | rag_prompt | llm | StrOutputParser()
)

question = "What are the approaches to Task Decomposition?"
chain.invoke({"context": docs, "question": question})

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4476.29 ms
llama_print_timings:      sample time =      31.65 ms /   103 runs   (    0.31 ms per token,  3254.55 tokens per second)
llama_print_timings: prompt eval time =    1054.67 ms /   258 tokens (    4.09 ms per token,   244.63 tokens per second)
llama_print_timings:        eval time =    4707.73 ms /   102 runs   (   46.15 ms per token,    21.67 tokens per second)
llama_print_timings:       total time =    6180.33 ms

' The approaches to task decomposition are (1) using simple prompting by LLM, (2) providing task-specific instructions for humans or LLMs to follow, and (3) utilizing expert models that execute specific tasks and log results. The challenges in long-term planning and task decomposition include adjusting plans in response to unexpected errors, making LLMs less robust than humans who learn from trial and error. Judging the correctness of task results involves evaluating the accuracy and completeness of the output.'

retriever = vectorstore.as_retriever()
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | llm | StrOutputParser()
)

question = "What are the approaches to Task Decomposition?"
qa_chain.invoke(question)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4476.29 ms
llama_print_timings:      sample time =      14.65 ms /    74 runs   (    0.20 ms per token,  5050.51 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    3464.94 ms /    74 runs   (   46.82 ms per token,    21.36 tokens per second)
llama_print_timings:       total time =    3674.59 ms

' There are three approaches to task decomposition: LLM with simple prompting, using task-specific instructions, or with human inputs. Long-term planning and task decomposition can be challenging, especially when exploring solution space and adjusting plans with unexpected errors. Task execution involves expert models executing specific tasks and logging results, which can then be judged for correctness.'