Month: November 2020

Using Stanford-Corenlp to get the Part of speech POS of Arabic Text, Python Example

Install :

$ pip install StanfordCoreNLP

And Download stanford-corenlp-4.1.0 and save it in your project’s folder

To Download more Arabic datasets go to Leipzig collection corporate website.

https://wortschatz.uni-leipzig.de/en/download/arabic

for the current example dataset

find the POS tag for each words in the text by using this format

word <space> tag <tap> word2 <space> tag2 …

from stanfordcorenlp import StanfordCoreNLP

def find_pos(xsent): 
    keepmyfinal =''
    with StanfordCoreNLP(r'stanford-corenlp-4.1.0', lang='ar') as nlp:
        #print(nlp.word_tokenize(sentence))
        Keepres = nlp.pos_tag(xsent)
        for k in Keepres:
            keepmyfinal += "{} {}\t".format( convert_ara_to_bw(k[0]),k[1])      
            
        
    return keepmyfinal

Let us get some result:

find_pos('ألا إنهم هم المفسدون ولكن لا يشعرون').rstrip()
Result:
>lA IN
<n IN
hm PRP
hm PRP
Almfsdwn DTNNS
w CC
lkn CC
lA RP
y$Erwn VBP

Read The file and find words shares the same tag:

Read text file

loadUnqList(p):
    klist = []
    with open(p) as fword:
        klist = fword.read().splitlines()
    return klist



KeepQuran = []
loadquran = loadUnqList('sample_msa_fixed.fo')  
print(len(loadquran))


# Result
50

we can search for tags like NNP noun

search_Tag = 'NNP'
numres = 200

keepres = []
for i in loadquran:
    xx = i.split('\t')
    for i in xx:
        xi = i.split(' ')
        if xi[1] == search_Tag:
            keepres.append(convert_bw_to_ara(xi[0]))
   


# Count the 
word frequency for each word

counts_nsw = collections.Counter(keepres)                        
clean_tweets_nsw = pd.DataFrame(counts_nsw.most_common(numres), columns=['words', 'count'])
similar_words=[i[0] for i in counts_nsw.most_common(numres)]
     

word_frequency = {}


# plot the result

for word_tuple in counts_nsw.most_common(numres):
    reshaped_word = arabic_reshaper.reshape(word_tuple[0])
    key = get_display(reshaped_word)
    word_frequency[key] = word_tuple[1]     
    
   

def plot_word_cloud(word_list: List[str], word_frequency: Dict[str, float]):
    full_string = ' '.join(word_list)
    reshaped_text = arabic_reshaper.reshape(full_string)
    translated_text = get_display(reshaped_text)   
    # Build the Arabic word cloud
    wordc = WordCloud(font_path='tahoma',background_color='white', width=800, height=300).generate(translated_text)
    wordc.fit_words(word_frequency)
    plt.imshow(wordc)

    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.title('Search in Quran Tags, By Faisal Alshargi')

    plt.show()
    

Result:

3 بنت
1 عبدالله 3
2 بن 3
3 عبدالعزيز 3
4 آل 3
.. … …
56 جدة 1
57 أبو 1
58 أمريكا 1
59 أيار 1
60 سوريا 1

Search for past verbs in the text

search_Tag = 'VBD'
numres = 200

Search for present verbs in the text

search_Tag = ‘VBP’
numres = 200

{ Add a Comment }

Build your own knowledge graph from text, by using Trump tweets, python example

import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher 
from spacy.tokens import Span 
import networkx as nx
import matplotlib.pyplot as plt
from tqdm import tqdm
pd.set_option('display.max_colwidth', 200)
%matplotlib inline

import trumps tweets

list_tweets = pd.read_csv("trump.csv")
list_tweets.shape


Result:
(678, 1)

Let’s check a few samples of tweets :

list_tweets['Tweets'].sample(2)

Result:
222    wow just starting to hear the democrats who are only thinking obstruct and delay are starting to put out the word that the time and scope of fbi looking into judge kavanaugh and witnesses is not e...
127    yesterday was a bad day for the cuomo brothers new york was lost to the looters thugs radical left and all others forms of lowlife amp scum the governor refuses to accept my offer of a dominating ...
Name: Tweets, dtype: object

Check the subject and object of one of these tweets.

doc = nlp("US election 2020: We put Republicans and Democrats in a group chat.")
    
for tok in doc:
  print(tok.text, "  >     ", tok.dep_)

Result:
US            >      compound
election      >      ROOT
2020          >      nummod
:             >      punct
We            >      nsubj
put           >      ROOT
Republicans   >      dobj
and           >      cc
Democrats     >      conj
in            >      prep
a             >      det
group         >      compound
chat          >      pobj
.             >      punct

Function to extract the subject and the object (entities) from a sentence.

def getentities_fromtweet(sent):
  ## part  1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence
  prefix = ""
  modifier = ""  
  for tok in nlp(sent):
    ## part 2
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " "+ tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " "+ tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier +" "+ prefix + " "+ tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier +" "+ prefix +" "+ tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text

  return [ent1.strip(), ent2.strip()]

it seems to be working as planned. In the sentence, ‘car’ is the subject and ‘200 colors’ is the object.

getentities_fromtweet("the car has 200 colors")

Result:
['car', '200  colors']

The Function below to extract the subject and the object (entities) from the tweets

pairs_entity = []
for i in tqdm(list_tweets["Tweets"]):
  pairs_entity.append(getentities_fromtweet(i))

Result:
100%|██████████| 678/678 [00:08<00:00, 78.01it/s]

The list of subject-object pairs from the Tweets Here is a few of them:

pairs_entity[11:20]

Result:
[['why  i', 'presidential bid trumpvlog'],
 ['higher  self', 'direct donald j trump'],
 ['that', 'federal  cont'],
 ['china', 'anywhere  world'],
 ['they', 'away  dems'],
 ['success', 'challenges setbacks'],
 ['always  you', 'one'],
 ['big  things', 'businesses'],
 ['here  that', 'business prospects']]

Using spaCy rule-based matching:

def get_relation(sent):
  doc = nlp(sent)
  # Matcher class object 
  matcher = Matcher(nlp.vocab)
  #define the pattern 
  pattern = [{'DEP':'ROOT'}, 
            {'DEP':'prep','OP':"?"},
            {'DEP':'agent','OP':"?"},  
            {'POS':'ADJ','OP':"?"}] 

  matcher.add("matching_1", None, pattern) 
  matches = matcher(doc)
  k = len(matches) - 1
  span = doc[matches[k][1]:matches[k][2]] 
  return(span.text)
# test the function
get_relation("Faisal completed the task")

Result:
completed

Get the relations from all the dataset:

relations = [get_relation(i) for i in tqdm(list_tweets['Tweets'])]

Result:
100%|██████████| 678/678 [00:08<00:00, 78.54it/s]

The most frequent relations or predicates that we have just extracted

pd.Series(relations).value_counts()[:20]

Result:
is         45
have       26
be         19
thank      18
wow        16
was        14
with        9
see         8
want        8
s           7
let         7
think       7
get         6
said        6
working     6
are         6
yes         5
know        5
has         5
need        5
dtype: int64

Build a Knowledge Graph


# extract subject
source = [i[0] for i in pairs_entity]

# extract object
target = [i[1] for i in pairs_entity ]

kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})

Create a directed-graph from a dataframe

G = nx.from_pandas_edgelist(kg_df, "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

Graph all the relations


plt.figure(figsize=(12,12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

Let’s graph the relation of “have”


G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="have"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph()
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

Another graph the relation of “thank”

G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="thank"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

One more graph the relation of “is”


G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="thank"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

{ Add a Comment }

Scatter Plot Graph with Arabic Text-labelled Data points


import matplotlib.pyplot as plt
from bidi.algorithm import get_display
import arabic_reshaper
import numpy as np



# plot function
def plotnow(words):  
    word_vectors = np.vstack([model[w] for w in words])
    word_vectors.shape   
    from sklearn.decomposition import PCA
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    twodim.shape  
    fig, ax = plt.subplots(1, figsize=(10, 6))
    fig.suptitle('Arabic Example Of Labelled Scatterpoints')
    # Plot the scatter points
    ax.scatter(twodim[:,0], twodim[:,1],
     color="red", # Color of the dots
     s=100, # Size of the dots
     alpha=0.5, # Alpha of the dots
     linewidths=1) # Size of edge around the dots

    for word, (x_pos,y_pos) in zip(words, twodim):
        xword = arabic_reshaper.reshape(word) # support arabic letters
        artext = get_display(xword) 
        ax.annotate(artext, # The label for point
        xy=(x_pos, y_pos), # Position of the corresponding point
        xytext=(7, 0), # Offset text by 7 points to the right
        textcoords='offset points', # tell it to use offset points
        ha='left', # Horizontally aligned to the left
        va='center') # Vertical alignment is centered
        # Show the plot
        plt.show()
    

from gensim.models import word2vec
sentences = [
    'تفتح التجارب الأولية لفحص جديد لفيروس كورونا الطريق أمام إمكانية تشخيص الفيروس خلال ثوان بدلا من ساعات.',
    'وأظهرت الأبحاث أن الفحص عن طريق التنفس، وهو أسلوب طور في مقاطعة ويلز، قد يكون قادرا عن التمييز بين فيروس كورونا وأي عدوى صدرية أخرى في الحال تقريبا.',
    'وجاء نشر البحث في دورية ذا لانسيت بعد إجراء تجارب في ألمانيا واسكتلندا.',
    'وقال المطورون (إيمسبيكس دياغنوستيكس) إن أجهزة الفحص قد تكون جاهزة للاستعمال خلال ستة شهور، إذا حصلوا على التمويل اللازم.'
]

for i, sentence in enumerate(sentences):
	tokenized= []
	for word in sentence.split(' '):
		word = word.split('.')[0]
		word = word.lower()
		tokenized.append(word)
	sentences[i] = tokenized

model = word2vec.Word2Vec(sentences, workers = 1, size = 200, min_count = 1, window = 2, sg = 0)

listofwords =  model.wv.vocab.keys()
plotnow(listofwords)


To get similar words in Arabic classic , Arabic modern standard and Arabic dialects, you can use this link:

{ Add a Comment }