created, $=dv.current().file.ctime
& modified, =this.modified
tags: Programming
rel: Parts of Language, Transcription Structure, Sentences
Containers
Primarily doc
, span
and token
doc
is the main object container. One attribute is doc.sents
which is a generator for sentences. span
can cross multiple tokens.
nlp = spacy.load("en_core_web_sm")
with open ("data/wiki_us.txt", "r") as f:
text = f.read()
doc = nlp(text)
Sentence Boundary Disambiguation
The problem of natural language processing tools dividing their input into sentences is difficult due to ambiguity of punctuation marks. A period may indicate the end of the sentence but also has other uses. About 47% of periods in Wall Street Journal corpus denote abbreviations. Question marks can be ambiguous due to emoticon, source code and slang.
Vanilla approach (95% efficiency)
- If it is a period, it ends a sentence.
- If the preceding token is in the hand compiled list of abbreviations, then it does not end a sentence.
- If the next token is capitalized, then it ends a sentence.
Idiosyncratic orthographical signs like
.hack//SIGN
, shortened names and other edge cases serve as the remaining 5%
Token Attributes
.text .head .left_edge .right_edge .ent_type .iob_ .lemma_ .morph .pos .dep_ .lang_
Word Vectors
Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.
Initial methods for creating word vectors in a pipeline take all words in a corpus and convert them into a single, unique number. These are then stored in a dictionary that would look like this “{the: 1, a: 2}
” This is known as a bag of words. This approach allows a computer to understand numerically to identify unique words. This doesn’t allow a computer to understand meaning however.
Example:
- Tom loves to eat chocolate.
- Tom likes to eat chocolate. These sentences as a numerical array might look like
- 1, 2, 3, 4, 5
- 1, 6, 3, 4, 5
As we can see, in humans both sentences are nearly identical. The only difference is the degree which Tom appreciates eating chocolate. If we examine the numbers however, these two sentences seem quite close but their semantic meaning is impossible to know for certain. How similar is 2 to 6? The number 6 could represent
hates
as much as it representslikes
.
Word vectors solve this by taking these one dimensional bag for words and gives them multidimensional meaning by representing them in higher dimensional space. It is achieved through machine learning and can be easily achieved by python libraries such as Gensim
.
The goal of word vectors is to achieve numerical understanding of language so that a computer can preform more complex tasks on that corpus. In the example above how do we get a computer to understand that 2 and 6 mean something similar. One option you might be thinking of is to give a computer a synonym dictionary. You can look up synonyms and know what the words mean.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
text = "Tom loves to eat chocolate"
words = text.split()
for word in words:
syns = dictionary.synonym(word)
print (f"{word}: {syns[0:5]}\n")
Even with simple sentences the results are comically bad. The reason is because synonym substitution does not take into account syntactical differences of synonyms. I do not believe anyone would think “Felis domesticus”, the Latin name of the common house cat, would be an adequate substitution for the name Tom.
What do Word Vectors look like
sentence1[0].vector
array([ 2.7204e-01, -6.2030e-02, -1.8840e-01, 2.3225e-02, -1.8158e-02, 6.7192e-03, -1.3877e-01, 1.7708e-01, 1.7709e-01, 2.5882e+00,
-3.5179e-01, -1.7312e-01, 4.3285e-01, -1.0708e-01, 1.5006e-01,
Word vectors once trained allow similarity matches very quickly and reliably.
your_word = dog
ms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]], n=10))
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
#['dog', 'KENNEL', 'dogs', 'CANINES', 'GREYHOUND', 'pet', 'Pet-Care', 'FELINE', 'cat', 'BEAGLES']
Doc similarity is also possible
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like salty fries and hamburgers")
doc2 = nlp("Fast food tastes very good.)
print(doc1, "<->", doc2, doc1.similarity(doc2))
= 0.7799485853415737
Regex and Spacy
Regex was invented by Stephen Cole Kleene in 1950s.
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
print (hit)
<callable_iterator object at 0x00000217A4256670>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>