Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Using LLMs and Embeddings to classify application errors (github.com/highlight)
65 points by vadman97 7 months ago | hide | past | favorite | 10 comments
Hi Hacker News! We’re Vadim and Chris from Highlight.io [1]. We do web app monitoring and are working on using LLMs/embeddings to add new functionality to our error monitoring product.

Given that there’s a lot of founders/engineers using LLMs in their products, we figured we’d share how we built the new functionality, their impact on our workflows, and how you can try it out.

Our goal was to build two features: (1) tagging errors (e.g. deeming an error as “authentication error” or a “database error”); and (2) grouping similar errors together (e.g. two errors that have a different stacktrace and body, but are semantically not very different).

Each of these rely heavily on comparing text across our application. After some experimentation with the OpenAI embeddings API [3], we went ahead and hosted a private model instance of thenlper/gte-large (an open-source MIT licensed model), which is a 1024-dimension model running on an Intel Ice Lake 2 vCPU machine on Hugging face [4].

Our general approach for classifying/comparing text is as follows. As each set of tokens (i.e a string) comes in, our backend makes a request to an inference endpoint and receives a 1024-dimension float vector as a response (see the code here [5]). We then store that vector using pgvector [6]. To compare any two sets for similarity, we simply look at the Euclidian distance between their respective embeddings using the ivfflat index implemented by pgvector (example code here [7]).

To tag errors, we assign an error its most relevant tag from a predetermined set decided by us. For example, if we tag an error as an "authentication error" or a "database error", we can allow developers to have a starting point before inspecting an issue.(see the logic here [8]).

Anecdotally, this approach seems to work very well. For example, here are two authentication errors that got tagged as “Authentication Error”:

    * Firebase: A network AuthError has occurred
    * Error retrieving user from firebase api for email verification: cannot find user from uid.
We also use these error embeddings to group similar errors. To decide whether an error joins a group or starts a new one, we decide on a distance threshold (using the euclidean distance) ahead of time. An interesting thing about this approach, compared to using a text-based heuristic, is that two errors with different stack traces can still be grouped together. Here’s an example:

    * github.com/highlight-run/highlight/backend/worker.(*Worker).ReportStripeUsage
    * github.com/highlight-run/highlight/backend/private-graph/graph.(*Resolver).GetSlackChannelsFromSlack.func1
Both reported as `integration api error` as they involve the Stripe and Slack integrations respectively. The neat thing is that the LLM can use the full context of an error and match based on the most relevant details about the error.

We have rolled out a first version of the error grouping logic to our cloud product [9], and there’s a demo of all the functionality at [2]. Long-term, if the HN community has other ideas of what we could build with LLM tooling in observability, we’re all ears. Let us know what you think!

Links

[1] https://news.ycombinator.com/item?id=36774611

[2] https://app.highlight.io/error-tags

[3] https://platform.openai.com/docs/guides/embeddings

[4] https://huggingface.co/thenlper/gte-large

[5] https://github.com/highlight/highlight/blob/main/backend/emb...

[6] https://github.com/highlight/highlight/blob/main/backend/mod...

[7] https://github.com/highlight/highlight/blob/main/backend/pub...

[8] https://github.com/highlight/highlight/blob/main/backend/pri...

[9] https://app.highlight.io




This is awesome!!

some questions:

1. Have you looked into finetuning the embedding model to your use case?

2. Have you considered faster foundation models?

3. How far could you go with this idea? Could this be the basis of a new monitoring platform?


> Have you looked into finetuning the embedding model to your use case?

Not yet, though this is definitely one of the next steps for us. The `gte-large` model we use is trained on a variety of text, but a hypothesis is one trained or fine-tuned on technical / code-related content may work better.

> Have you considered faster foundation models?

Any in particular that you would suggest? We're still pretty new to this so would love to learn about other recommendations. Would a foundation model perform as well at this task?

> How far could you go with this idea? Could this be the basis of a new monitoring platform?

Certainly; there are traditional ML approaches that could be applied to monitoring as well and we're heavily exploring this (ie. for metric anomaly detection). Another area we're exploring embeddings grouping is for filtering ingest to help folks only ingest / retain data that they actually want, but without the overhead of strict filter rules. Tons more to explore in this space, and you will certainly be hearing more from us here.


Very cool! I am curious about a few aspects:

How do you handle the scenario where errors may exhibit evolving characteristics over time, which might potentially impact the effectiveness of the current embedding model? Is there a mechanism for ongoing model adaptation to ensure sustained accuracy and relevance?

Lastly, could you envision a scenario where this technology is extended to not only classify, but also to predict potential errors based on historical and real-time data?


Regarding model adaptation, we haven't yet explored a fine-tuned model, but it makes a lot of sense for a given class of errors. For a given code-base that is using highlight, the errors will typically be of a given language / infrastructure, so fine-tuning the model to those errors should be beneficial.

As for predicting potential errors, this will particularly make sense as an anomaly detection mechanism across metrics and logs. A class of 'important' errors based on the LLM's understand of the error, as well as historical comparison to normal error rates, is something we're exploring with an 'interesting errors' concept - stay tuned for more there!


This is really cool! We'd love to collaborate; if interested plz drop an email :) sam@elide.dev


how does this work economically? we’re at pretty small scale but we process millions of errors events every day

does this get expensive?


Our errors ingest is quite affordable (https://www.highlight.io/pricing) and let's you filter out data you are not interested in ingesting. Our logs ingest is even cheaper because of the underlying ClickHouse storage we use.


why do you have so many errors at a "small scale"? Does every request error multiple times?


Presumably, they mean millions of logs a day when saying "millions of events a day", of which some subset corresponds to errors.


sorry i was very unclear - we also offer an issue type feature in our product: https://docs.sst.dev/console#issues

and these are our customers errors - i was wondering if doing this didn't hurt your units costs too much

that said i played with cloudflare's new service since i asked the question and it seems viable




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: