Can we trust our Copilot summary? Part two - an analysis

In the previous article, I set out a case that Copilot summaries of long pieces of text are not necessarily to be trusted, because they don't actually understand the text they're being asked to summarise - they have no knowledge of what is and isn't important in a piece of text.

In this article I'm going to look at the Copilot summary of that article see if my assertion stands up to analysis. Gerben Wierda has also done a more extensive comparison of a ChatGPT summary against a human summary.

The article “Can we trust our Copilot summary? Part one - the issues” explores the reliability of AI-generated summaries, specifically those produced by Copilot. Here are the key points:

  1. Initial Assertion:
    • The article starts by referencing Ian Betteridge’s assertion that the answer to the question in the headline is “No.”
    • The author aims to explain how Large Language Models (LLMs) like Copilot work, leading to the conclusion that their summaries cannot be fully trusted.
  2. Understanding Large Language Models (LLMs):
    • LLMs are described as “Probability Engines for Words,” similar to predictive text on phones or Markov Chains.
    • Markov Chains predict word sequences based on probabilities derived from previous data but can generate nonsensical text.
  3. Neural Networks and Training:
    • Unlike simple Markov Chains, LLMs use neural networks and are trained on vast amounts of data to understand context and relationships between words.
    • Training involves assigning words to multiple dimensions, such as parts of speech and semantic meanings, to improve the accuracy of generated text.
  4. Challenges and Limitations:
    • Despite advanced training, LLMs can still produce “hallucinations” or nonsensical content if not meticulously trained.
    • The article highlights the importance of human oversight in ensuring the accuracy and relevance of AI-generated summaries.
  5. Comparison with Human Summaries:
    • The author plans to compare a human-generated summary with a Copilot-generated one in Part Two of the article to evaluate their accuracy.
  6. Importance of Context:
    • The article emphasizes that while AI-generated summaries can be useful for initial drafts, they should not be solely relied upon for important decisions or detailed understanding.

If I was asked to score this, I would - to use an allusion from the series Chernobyl - at first glance score it at 3.6 röntgens per hour, or, putting it another way, 'not great, not terrible'. I think that's probably a common initial reaction to a Copilot summary.

Going through the summary in detail, I think my assertion is supported; first of all the reference to Betteridge's Law is hardly the top most important point to kick the summary off with!

For point two, I would say a useful summary would summarise how a Markov Chain, or your phone, predicts the next word from the previous - I would make the point that each word is chosen from the probability of a given word following another word in the text the chain is based on. I wouldn't summarise that it can generate nonsensical text, I'd summarise that the text is either gibberish or close to the original depending on the setting.

For points three and four the summary I don't think is entirely bad, I just think it misses the point of the summary - stating that the Large Language Model needs to be meticulously trained is kind of stating the obvious; what's important to summarise there is the difference between a Markov Chain being a brute force probability engine and a LLM being more nuanced because the extensive multidimensional training process gives the model more information about which words are more or less likely to follow other words based on probabilities of how those words are usually used in real sentences. And the point about hallucinations is that no matter how meticulously trained the model is, because it doesn't actually know what frogs and dogs are it doesn't know anything about dogs sitting on frogs; similarly the important point is that it's possible to force a LLM to generate a hallucination with a deliberate prompt because it doesn't know the prompt it has been given is ridiculous.

Ironically, Copilot's summary of the article doesn't even refer to the description of what a summary is! Maybe it's offended that I said it doesn't understand the text it's being asked to summarise...

It's odd to me that it has placed point five as point five, given the article said in the introduction that I was going to do this followup article.

And finally, point six whilst accurately summarising that Copilot can be helpful in creating the initial draft of a summary, I don't think it adequately conveys that if point of the summary is to assist decision-making rather than just passing interest for the benefit of somebody who wouldn't have read the original anyway, then the Copilot first draft must be edited by a human to check nothing important is missed out.

Related news