On 20 July 2023, Max Lorenz, a tech guru at AI startup and co-organizer FOSSG, participated in the keynote of this Midsummer Open Source Night, presenting his thoughts on AI and open source technologies. Here's what he said.
Max originally hails from Germany and had been in Singapore since 2018. Now happily married and an ardent supporter of open source, he had been working in an AI startup specializing in LLM (Large Language Model) and also part of the organizer team of FOSSG.
Max 最初来自德国，自2018年以来一直在新加坡工作。现在他已经结婚，是开源的狂热支持者，曾在一家专门从事 LLM（大型语言模型）的人工智能创业公司工作，并且是 FOSSG 组织团队的一员。
Full text of the speech
I'm Max and so yeah I'm basically，I've been working in Singapore for five years on AI products and so sometime ago I was thinking why don't I just start my own startup right now，it's the perfect time.This seems to be so much hop around AI products compared to 5 years ago.Now you can actually pitch it to companies,and people are interested to hear about it，and so basically what I want to do the fancy part，which is easier to sell，which is the demo part.
I want to tell you what happened after I demote and decided to turn something into a product，because I think this is a topic that not everyone touches on all the time.If it works in LangChain with one or two documents，it's great！If you put it into production, it doesn't always work，exactly the magic that you expected it to.So the first thing that I noticed.
I just basically followed some the tutorials，I've been using GPT3 for one and a half years before.So I kind knew what I was getting into when I pitch my ideas.And I was like see here this can write emails for you，you can upload your PDF.Everything that we've seen from the first talk,but then once I uploaded like 100 pdfs,cementic search,actually did I perform how I expected it and so the people using it were very confused they're like the demo it looked so good.
Why is it not working with my data?Is there something wrong?We know the customer what you can do.
And the second part is sometimes there were so many edge cases, so it doesn't matter how intricate my GPT4 prompt, and in the end it was never up to the task.There were always so many edge cases,and it didn't work here,so sunny was clear,maybe one prompt is not enough,maybe one search is not enough,and especially with LangChain.
My next issue was very slow to this day I use LaughChain or I use it all the time,but mostly for demo purposes if I put into production where you really care about how fast can this product work,how accurate can it work.I usually start just fighting stuff from scratch.
So the first issue that we encountered as the starter was how we have all the data， how do we store this，do we just use the typical vectoral database that everyone seems to be raving about these days,and they certainly have the great technology,but for us,we need a more sophisticated approach with a lot more control.So the first thing that we tried,maybe we can like fine tune GPT.Of course,that's not exactly how it works.Because you have so many new documents coming in all the time and even fine tuning doesn't retain all the information.So we need to start the data.There're so many databases to choose and what we figured out is very quick,but companies don't care about documents in a site.They don't wanna find like a sentence or a snippet in a document all the time,except for a few select use cases.Companies usually care about we have assets.
we have companies that we talk to we have competitories,and they want to know that your product can handle to the best of your abilities.So you need to find a way to store them in a separate way,search them in a separate way,and handle them differently.
So how do we get the data right now that we know?We have probably more than one database,I'm gonna show you later.But we're using postgrades where you can combine a traditional and normal database with vector search.
How do we feed the data?In many companies have 1,000 PDF files,word documents,Excel sheets and images,sort out to the first talk LLM.They have so many images with descriptions and titles.what do people really care about if they search should the title match,should the text match or should the image match,or as a mix of control.So first of all you need to figure out which is the best embedding model that you can use.For all your different modalities.
How can you use traditional search engines?Normal BM25 show the next light show do you extract the metada data from your data,from your documents.So if you're using Crms company names and so long story short,we discovered we've been just using postgress for now,which has an amazing open source community behind different plugins as well.
PGvector is great if you want to do vector search.For embeddings,we have keyword search that works well.And if you look for embeddings,there's a couple of modern papers,like the E5 for Microsoft is really really good.So if you're still using GPT for your embeddings,for the Ada embeddings,for our use cases,if I have performed that by huge margin,if you have enough data,if you have the thousands of documents and above fine-tuning(the embedding model),actually using huge returns fine-tuning tends to like fall short,when the people know exactly what they're looking for.And they're using Google style varies.I just want to have this one piece of information,then cementic search is not ideal so that why we also combine it with traditional just keyboard search.BM25 programs are really well easy to improve with recency and sources, so most people care about new information more.
If you have a huge confluent page again hundreds of pages,putting a higher value on a more modern or more recent editor for documents,it's great. There's also re-ranking models which are expensive, but basically, if you look at the past hundred search results and you need another AI model to re-rank how this snippet is to answer the question that the user has improved set as well.This one covers all the search force.But the next question is that we have all the documents.
Let's say acid that a company and people equery for these things,we also use knowledge graphs,so basically we store every person,every entity,every acid in our database with some vector.Once we ingest a document we store and when people ask for something,we give them the ability to scope their search.They say that I want to know about competitor X,then we can limit our documents to where the competitor X actually is mentioned in.
All right,now,the next or the last step is okay.Now you have amazing search right and you can base on the search results,and base on some other stuff,you can run different prompts.
How do you put it into production?How do you make sure you catch all the error cases?How do you make sure it gets better over time?
First thing, we use a lot of classification, so if people ask different things, sometimes they just want to chat,and other stuff you want to be able to categorize that.If you ask GPT what AOB is, this text usually means that the accuracy is not high and is very slow.So we usually build our own models.
There's a couple of open source projects using GPT.Forward to generate artificial training data auto labels,great product can recommend great open source tool,next thing you want to call to API.if you have a CRM,you want to get the latest version of acid,you should use function calling models like the GPT.Currently there was like announced a couple weeks back,if you just asked GPT to return jason agent,sometimes hallucinates fields.Whereas the dedicated models are restricted to only output"tokens that are valid for that specific chasing call",so this is really easy to integrate.New API's way and last thing for me are the game changers that monitoring you always want to record everything that's going on,and you want to use typical data science methods,you want to capture.How is it received by the user?Can we get any metrics to see?How are they using the product?Sometimes just a thumbs up and thumbs down button is good enough to capture trends, weights and biases.
Having a great integration for LungChain where you can capture every single step.What is the agent doing?Is it here you can see?You get all the settings from GPT4.So if someone changes the GPT4 settings, you want to record and you want to see if it increases or decreases some electric over time,that's it.
Thank you very much!
Scan the code to get the complete PPT
About KCC Singapore
KCC Singapore, founded on July 20, 2023, is the first step in the open source community's global strategy. Our mission is to empower developers to embrace and contribute to open source. Through partnerships with universities, tech companies, and government departments, we aim to promote open source adoption in Singapore's digital economy. Working closely with local open source communities and forging global connections, we amplify the voice of Chinese open source. Together, we empower open source for a brighter digital future.
相关阅读 | Related Reading