created, $=dv.current().file.ctime & modified, =this.modified tags:nlpphilosophyai

Types

Extraction-based summarization

  • Content is extracted from the original data, but the extracted content is not modified in any way. For text, analogous to the process of skimming, where important pieces of text (headings, first and last sentences of paragraphs) are extracted prior to reading the entire article. Abstractive-based summarization
  • Generate new text that did not exist in the original text. These build an internal semantic representation of the original content (the language model) then use this representation to create a summary of what a human might express. This is more difficult that extraction, requiring NLP and domain understanding. Paraphrasing is even more difficult to apply to images and videos, which is why most summarization systems are extractive. Aided summarization
  • A human post-processes software output, like how one edits the output of automatic translations in Google translate.

Tasks

Generic summarization -

  • focuses on obtaining a generic summary or abstract of the collection. Query summarization -
  • focuses objects specific to a query.

At a very high level, summarization algorithms try to find a subset of objects (let a set of sentences or a set of images) which cover information of the entire set. This is called a core-set

Keyphrase extraction

  • You are given a piece of text and must produce a list of keywords that capture the primary topics discussed in the text.
  • This is useful for improvements to information retrieval, short summarization, and generating indexes.

Supervised Learning

  • Given a document, construct an example for each unigram, bigram and trigram in the text. Then compute various features describing each example. Assume we know keyphrases, which are used to assign positive or negative labels to the examples. Some classifiers will make binary classifications while others will assign probability.
  • Requires training data.

Text-Rank

Inspired by PageRank, TR leverages a graph-based approach to analyze the relationships between words within a document. TR assigns a score to each word by considering these connections, highlighting the most informative terms that capture the document’s essence.

Co-occurence

Co-Occurrence - is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus.

Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. Corpus linguistics and its statistic analyses reveal patterns of co-occurrences within a language and enable to work out typical collocations for its lexical items. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language

Effective Summarization - Evaluation

The most common way to compare with human-made model summaries.

Evaluation can be intrinsic or extrinsic. Intrinsic assesses the summaries directly, while extrinsic evaluation evaluates how the summarization system affects the completion of some other task.

Intrinsic = mainly the coherence and informativeness of the summaries. Extrinsic = impact of summarization on task, like reading comprehensions

Human judgement often varies greatly in what it considers a “good” summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning coherence and coverage.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

is common way. It attempts to tell how well a summary covers the content of human-generated summaries known as references.

ROUGE cannot determine if the result is coherent, that is if sentences flow together in a sensibly

ROUGE - Recall-Oriented Understudy for Gisting Evaluation and BLEU does not manage different words that have the seam meaning. It measures syntactical matches, rather than semantics.