created, $=dv.current().file.ctime & modified, =this.modified tags: Programming rel: Parts of Language, Transcription Structure, Sentences

Containers

Primarily doc, span and token

doc is the main object container. One attribute is doc.sents which is a generator for sentences. span can cross multiple tokens.

nlp = spacy.load("en_core_web_sm")
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)

Sentence Boundary Disambiguation

The problem of natural language processing tools dividing their input into sentences is difficult due to ambiguity of punctuation marks. A period may indicate the end of the sentence but also has other uses. About 47% of periods in Wall Street Journal corpus denote abbreviations. Question marks can be ambiguous due to emoticon, source code and slang.

Vanilla approach (95% efficiency)

If it is a period, it ends a sentence.
If the preceding token is in the hand compiled list of abbreviations, then it does not end a sentence.
If the next token is capitalized, then it ends a sentence. Idiosyncratic orthographical signs like .hack//SIGN, shortened names and other edge cases serve as the remaining 5%

Token Attributes

.text .head .left_edge .right_edge .ent_type .iob_ .lemma_ .morph .pos .dep_ .lang_

Word Vectors

Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.

Initial methods for creating word vectors in a pipeline take all words in a corpus and convert them into a single, unique number. These are then stored in a dictionary that would look like this “{the: 1, a: 2}” This is known as a bag of words. This approach allows a computer to understand numerically to identify unique words. This doesn’t allow a computer to understand meaning however.

Example:

Tom loves to eat chocolate.
Tom likes to eat chocolate. These sentences as a numerical array might look like
1, 2, 3, 4, 5
1, 6, 3, 4, 5 As we can see, in humans both sentences are nearly identical. The only difference is the degree which Tom appreciates eating chocolate. If we examine the numbers however, these two sentences seem quite close but their semantic meaning is impossible to know for certain. How similar is 2 to 6? The number 6 could represent hates as much as it represents likes.

Word vectors solve this by taking these one dimensional bag for words and gives them multidimensional meaning by representing them in higher dimensional space. It is achieved through machine learning and can be easily achieved by python libraries such as Gensim.

The goal of word vectors is to achieve numerical understanding of language so that a computer can preform more complex tasks on that corpus. In the example above how do we get a computer to understand that 2 and 6 mean something similar. One option you might be thinking of is to give a computer a synonym dictionary. You can look up synonyms and know what the words mean.

from PyDictionary import PyDictionary

dictionary=PyDictionary()
text = "Tom loves to eat chocolate"

words = text.split()
for word in words:
    syns = dictionary.synonym(word)
    print (f"{word}: {syns[0:5]}\n")

Even with simple sentences the results are comically bad. The reason is because synonym substitution does not take into account syntactical differences of synonyms. I do not believe anyone would think “Felis domesticus”, the Latin name of the common house cat, would be an adequate substitution for the name Tom.

What do Word Vectors look like

sentence1[0].vector
array([ 2.7204e-01, -6.2030e-02, -1.8840e-01,  2.3225e-02, -1.8158e-02, 6.7192e-03, -1.3877e-01,  1.7708e-01,  1.7709e-01,  2.5882e+00,
 -3.5179e-01, -1.7312e-01,  4.3285e-01, -1.0708e-01,  1.5006e-01,

Word vectors once trained allow similarity matches very quickly and reliably.

your_word = dog
ms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]], n=10))
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
#['dog', 'KENNEL', 'dogs', 'CANINES', 'GREYHOUND', 'pet', 'Pet-Care', 'FELINE', 'cat', 'BEAGLES']

Doc similarity is also possible

nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like salty fries and hamburgers")
doc2 = nlp("Fast food tastes very good.)

print(doc1, "<->", doc2, doc1.similarity(doc2))
= 0.7799485853415737

Regex and Spacy

Regex was invented by Stephen Cole Kleene in 1950s.

text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
    print (hit)

<callable_iterator object at 0x00000217A4256670>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>

Infinite Digression

Explorer

spaCy3