Browsing: Machine Learning

Using Stanford-Corenlp to get the Part of speech POS of Arabic Text, Python Example

Install :

$ pip install StanfordCoreNLP

And Download stanford-corenlp-4.1.0 and save it in your project’s folder

To Download more Arabic datasets go to Leipzig collection corporate website.

https://wortschatz.uni-leipzig.de/en/download/arabic

for the current example dataset

find the POS tag for each words in the text by using this format

word <space> tag <tap> word2 <space> tag2 …

from stanfordcorenlp import StanfordCoreNLP

def find_pos(xsent): 
    keepmyfinal =''
    with StanfordCoreNLP(r'stanford-corenlp-4.1.0', lang='ar') as nlp:
        #print(nlp.word_tokenize(sentence))
        Keepres = nlp.pos_tag(xsent)
        for k in Keepres:
            keepmyfinal += "{} {}\t".format( convert_ara_to_bw(k[0]),k[1])      
            
        
    return keepmyfinal

Let us get some result:

find_pos('ألا إنهم هم المفسدون ولكن لا يشعرون').rstrip()
Result:
>lA IN
<n IN
hm PRP
hm PRP
Almfsdwn DTNNS
w CC
lkn CC
lA RP
y$Erwn VBP

Read The file and find words shares the same tag:

Read text file

loadUnqList(p):
    klist = []
    with open(p) as fword:
        klist = fword.read().splitlines()
    return klist



KeepQuran = []
loadquran = loadUnqList('sample_msa_fixed.fo')  
print(len(loadquran))


# Result
50

we can search for tags like NNP noun

search_Tag = 'NNP'
numres = 200

keepres = []
for i in loadquran:
    xx = i.split('\t')
    for i in xx:
        xi = i.split(' ')
        if xi[1] == search_Tag:
            keepres.append(convert_bw_to_ara(xi[0]))
   


# Count the 
word frequency for each word

counts_nsw = collections.Counter(keepres)                        
clean_tweets_nsw = pd.DataFrame(counts_nsw.most_common(numres), columns=['words', 'count'])
similar_words=[i[0] for i in counts_nsw.most_common(numres)]
     

word_frequency = {}


# plot the result

for word_tuple in counts_nsw.most_common(numres):
    reshaped_word = arabic_reshaper.reshape(word_tuple[0])
    key = get_display(reshaped_word)
    word_frequency[key] = word_tuple[1]     
    
   

def plot_word_cloud(word_list: List[str], word_frequency: Dict[str, float]):
    full_string = ' '.join(word_list)
    reshaped_text = arabic_reshaper.reshape(full_string)
    translated_text = get_display(reshaped_text)   
    # Build the Arabic word cloud
    wordc = WordCloud(font_path='tahoma',background_color='white', width=800, height=300).generate(translated_text)
    wordc.fit_words(word_frequency)
    plt.imshow(wordc)

    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.title('Search in Quran Tags, By Faisal Alshargi')

    plt.show()
    

Result:

3 بنت
1 عبدالله 3
2 بن 3
3 عبدالعزيز 3
4 آل 3
.. … …
56 جدة 1
57 أبو 1
58 أمريكا 1
59 أيار 1
60 سوريا 1

Search for past verbs in the text

search_Tag = 'VBD'
numres = 200

Search for present verbs in the text

search_Tag = ‘VBP’
numres = 200

{ Add a Comment }

Build your own knowledge graph from text, by using Trump tweets, python example

import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher 
from spacy.tokens import Span 
import networkx as nx
import matplotlib.pyplot as plt
from tqdm import tqdm
pd.set_option('display.max_colwidth', 200)
%matplotlib inline

import trumps tweets

list_tweets = pd.read_csv("trump.csv")
list_tweets.shape


Result:
(678, 1)

Let’s check a few samples of tweets :

list_tweets['Tweets'].sample(2)

Result:
222    wow just starting to hear the democrats who are only thinking obstruct and delay are starting to put out the word that the time and scope of fbi looking into judge kavanaugh and witnesses is not e...
127    yesterday was a bad day for the cuomo brothers new york was lost to the looters thugs radical left and all others forms of lowlife amp scum the governor refuses to accept my offer of a dominating ...
Name: Tweets, dtype: object

Check the subject and object of one of these tweets.

doc = nlp("US election 2020: We put Republicans and Democrats in a group chat.")
    
for tok in doc:
  print(tok.text, "  >     ", tok.dep_)

Result:
US            >      compound
election      >      ROOT
2020          >      nummod
:             >      punct
We            >      nsubj
put           >      ROOT
Republicans   >      dobj
and           >      cc
Democrats     >      conj
in            >      prep
a             >      det
group         >      compound
chat          >      pobj
.             >      punct

Function to extract the subject and the object (entities) from a sentence.

def getentities_fromtweet(sent):
  ## part  1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence
  prefix = ""
  modifier = ""  
  for tok in nlp(sent):
    ## part 2
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " "+ tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " "+ tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier +" "+ prefix + " "+ tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier +" "+ prefix +" "+ tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text

  return [ent1.strip(), ent2.strip()]

it seems to be working as planned. In the sentence, ‘car’ is the subject and ‘200 colors’ is the object.

getentities_fromtweet("the car has 200 colors")

Result:
['car', '200  colors']

The Function below to extract the subject and the object (entities) from the tweets

pairs_entity = []
for i in tqdm(list_tweets["Tweets"]):
  pairs_entity.append(getentities_fromtweet(i))

Result:
100%|██████████| 678/678 [00:08<00:00, 78.01it/s]

The list of subject-object pairs from the Tweets Here is a few of them:

pairs_entity[11:20]

Result:
[['why  i', 'presidential bid trumpvlog'],
 ['higher  self', 'direct donald j trump'],
 ['that', 'federal  cont'],
 ['china', 'anywhere  world'],
 ['they', 'away  dems'],
 ['success', 'challenges setbacks'],
 ['always  you', 'one'],
 ['big  things', 'businesses'],
 ['here  that', 'business prospects']]

Using spaCy rule-based matching:

def get_relation(sent):
  doc = nlp(sent)
  # Matcher class object 
  matcher = Matcher(nlp.vocab)
  #define the pattern 
  pattern = [{'DEP':'ROOT'}, 
            {'DEP':'prep','OP':"?"},
            {'DEP':'agent','OP':"?"},  
            {'POS':'ADJ','OP':"?"}] 

  matcher.add("matching_1", None, pattern) 
  matches = matcher(doc)
  k = len(matches) - 1
  span = doc[matches[k][1]:matches[k][2]] 
  return(span.text)
# test the function
get_relation("Faisal completed the task")

Result:
completed

Get the relations from all the dataset:

relations = [get_relation(i) for i in tqdm(list_tweets['Tweets'])]

Result:
100%|██████████| 678/678 [00:08<00:00, 78.54it/s]

The most frequent relations or predicates that we have just extracted

pd.Series(relations).value_counts()[:20]

Result:
is         45
have       26
be         19
thank      18
wow        16
was        14
with        9
see         8
want        8
s           7
let         7
think       7
get         6
said        6
working     6
are         6
yes         5
know        5
has         5
need        5
dtype: int64

Build a Knowledge Graph


# extract subject
source = [i[0] for i in pairs_entity]

# extract object
target = [i[1] for i in pairs_entity ]

kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})

Create a directed-graph from a dataframe

G = nx.from_pandas_edgelist(kg_df, "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

Graph all the relations


plt.figure(figsize=(12,12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

Let’s graph the relation of “have”


G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="have"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph()
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

Another graph the relation of “thank”

G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="thank"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

One more graph the relation of “is”


G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="thank"], "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) 
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()

{ Add a Comment }

Scatter Plot Graph with Arabic Text-labelled Data points


import matplotlib.pyplot as plt
from bidi.algorithm import get_display
import arabic_reshaper
import numpy as np



# plot function
def plotnow(words):  
    word_vectors = np.vstack([model[w] for w in words])
    word_vectors.shape   
    from sklearn.decomposition import PCA
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    twodim.shape  
    fig, ax = plt.subplots(1, figsize=(10, 6))
    fig.suptitle('Arabic Example Of Labelled Scatterpoints')
    # Plot the scatter points
    ax.scatter(twodim[:,0], twodim[:,1],
     color="red", # Color of the dots
     s=100, # Size of the dots
     alpha=0.5, # Alpha of the dots
     linewidths=1) # Size of edge around the dots

    for word, (x_pos,y_pos) in zip(words, twodim):
        xword = arabic_reshaper.reshape(word) # support arabic letters
        artext = get_display(xword) 
        ax.annotate(artext, # The label for point
        xy=(x_pos, y_pos), # Position of the corresponding point
        xytext=(7, 0), # Offset text by 7 points to the right
        textcoords='offset points', # tell it to use offset points
        ha='left', # Horizontally aligned to the left
        va='center') # Vertical alignment is centered
        # Show the plot
        plt.show()
    

from gensim.models import word2vec
sentences = [
    'تفتح التجارب الأولية لفحص جديد لفيروس كورونا الطريق أمام إمكانية تشخيص الفيروس خلال ثوان بدلا من ساعات.',
    'وأظهرت الأبحاث أن الفحص عن طريق التنفس، وهو أسلوب طور في مقاطعة ويلز، قد يكون قادرا عن التمييز بين فيروس كورونا وأي عدوى صدرية أخرى في الحال تقريبا.',
    'وجاء نشر البحث في دورية ذا لانسيت بعد إجراء تجارب في ألمانيا واسكتلندا.',
    'وقال المطورون (إيمسبيكس دياغنوستيكس) إن أجهزة الفحص قد تكون جاهزة للاستعمال خلال ستة شهور، إذا حصلوا على التمويل اللازم.'
]

for i, sentence in enumerate(sentences):
	tokenized= []
	for word in sentence.split(' '):
		word = word.split('.')[0]
		word = word.lower()
		tokenized.append(word)
	sentences[i] = tokenized

model = word2vec.Word2Vec(sentences, workers = 1, size = 200, min_count = 1, window = 2, sg = 0)

listofwords =  model.wv.vocab.keys()
plotnow(listofwords)


To get similar words in Arabic classic , Arabic modern standard and Arabic dialects, you can use this link:

{ Add a Comment }

Find the emotions in text for any twitter account, Trump, Biden and Obama case study

By Using this code, you can calculate the emotions from the text.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import collections
import twitter
import seaborn as sns
import tweepy as tw
import nltk
from nltk.corpus import stopwords
import warnings
import re
import text2emotion as te

# 
warnings.filterwarnings("ignore")
sns.set(font_scale=1.5)
sns.set_style("whitegrid")

consumer_key= 'Enter your consumer_key'
consumer_secret= 'Enter your consumer_secret'
access_token= 'Enter your access_token'
access_token_secret= 'Enter your access_token_secret'


auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)


api = twitter.Api(consumer_key=consumer_key,
  consumer_secret=consumer_secret,
  access_token_key=access_token,
  access_token_secret=access_token_secret)


#write  the twitter account name
Account_name  = "@BarackObama"  
 
keepres = api.GetUserTimeline(screen_name = Account_name, count=200)

# create lists to save the emotions type for each tweets 
Happylst = []
Angrylst = []
Surprist= []
Sadlst = []
Fearlst= []
keep_Happyx = 0
keep_Angryx = 0
keep_Surprisex = 0
keep_Sadx = 0
keep_Fearx = 0


all_tweets = []
tweets = [i.AsDict() for i in keepres]
for t in tweets:
    all_tweets.append(t['text'])

 
# remove links, clean the tweets
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())


all_tweets_no_urls = [remove_url(tweet) for tweet in all_tweets]

 1
for t in all_tweets_no_urls:
    keepemo = te.get_emotion(t)
    Happylst.append(keepemo['Happy'])
    Angrylst.append(keepemo['Angry'])
    Surprist.append(keepemo['Surprise'])
    Sadlst.append(keepemo['Sad'])
    Fearlst.append(keepemo['Fear'])
Result:
# find the maximum feeling in each tweet
 
def find_maxemotion(Happyx, Angryx,  Surprisex, Sadx, Fearx ):
    res_max = ''
    res_max_value = 0
    
    if Happyx > Angryx and Happyx > Surprisex and Happyx > Sadx and Happyx > Fearx:
        res_max = 'Happy'
        #res_max_value = Happyx
    
    if Angryx > Happyx and Angryx > Surprisex and Angryx > Sadx and Angryx > Fearx:
        res_max = 'Angry'
        #res_max_value = Angryx
        
        
    if Surprisex > Happyx and Surprisex > Angryx and Surprisex > Sadx and Surprisex > Fearx:
        res_max = 'Surprise'
        #res_max_value = Surprisex
        
    if Sadx > Happyx and Sadx > Angryx and Sadx > Surprisex and Sadx > Fearx:
        res_max = 'Sad'
        #res_max_value = Sadx
   
    if Fearx > Happyx and Fearx > Angryx and Fearx > Surprisex and Fearx > Sadx:
        res_max = 'Fear'
        #res_max_value = Fearx
    
    return res_max



for t in all_tweets_no_urls:
    keepemo = te.get_emotion(t)
    emres = find_maxemotion(keepemo['Happy'], keepemo['Angry'], keepemo['Surprise'], keepemo['Sad'], keepemo['Fear'])
    Happylst.append(keepemo['Happy'])
    Angrylst.append(keepemo['Angry'])
    Surprist.append(keepemo['Surprise'])
    Sadlst.append(keepemo['Sad'])
    Fearlst.append(keepemo['Fear'])
    if emres == 'Happy':
        keep_Happyx += 1

    if emres == 'Angry':
        keep_Angryx += 1
        
    if emres == 'Surprise':
        keep_Surprisex += 1
        
    if emres == 'Sad':
        keep_Sadx += 1

    if emres == 'Fear':
        keep_Fearx += 1

print('Finish reading')
# print number of each emotion
print('Happy', keep_Happyx)
print('Angry', keep_Angryx)
print('Surprise', keep_Surprisex)
print('Sad', keep_Sadx)
print('Fear', keep_Fearx)
# print the first 4 tweets 
all_tweets_no_urls[4]
Result:
['Banking Marketplace Making a Wise Pivot Banking is necessary banks are not',
 'The aim is to enable people to discover whether the doctors and hospitals they visit may have motives other than pa',
 'Civilized people respect others Civilized people know the limits of their freedom of speech Your freedom ends w',
 'PayPal to allow cryptocurrency buying selling and shopping on its network']
# The result in chart
import matplotlib.pyplot as plt
names = ['Happy', 'Angry', 'Surprise', 'Sad','Fear']
values_1 = []
values_1.append(keep_Happyx)
values_1.append(keep_Angryx)
values_1.append(keep_Surprisex)
values_1.append(keep_Sadx)
values_1.append(keep_Fearx)
   
x =[]
for xc in range(len(y1)):
    x.append(xc)
    
plt.figure(figsize=(30, 9))
plt.subplot(131)
plt.bar(names, values_1)
plt.title(Account_name + " Emotions ")


plt.subplot(132)
plt.title( Account_name  + " Emotions - Happy (green), Fear (red)" )
plt.plot(x, Happylst,  marker='o', color="green",  markersize=2, linewidth=2 , label="Happy")
plt.plot(x, Fearlst,  marker='o', color="red",  markersize=2, linewidth=2 , label="Fear")

Biden tweets:

# the last 4 Tweetes
'RT BarackObama Dont boo vote Happy Halloween everybody',
 'President Obama and I left President Trump a playbook on how to deal with pandemics He flatout ignored it And we',
 'If youre planning to vote early inperson today is your last day to do so in many states Head to',
 'RT TeamJoe Voting Update Early vote hours have been extended in Douglas County NE to 500 PM today Find your early voting location a']

Trump tweets:

#the last 4 tweets:
['Biden & Obama owe a massive apology to the People of Flint The water was poisoned on their watch Not only did the',
 'Bidens plan to Abolish American Energy is an economic DEATH SENTENCE for Pennsylvania A vote for Biden is a vote',
 'Our ECONOMY is now surging back faster better bigger and stronger than any nation on earth We just had the best',
 'I ran for office 4 years ago because I could not sit by amp watch any longer as a small group of Washington Insiders']

Obama tweets:

#the last 4 tweets:
['Dont boo vote Happy Halloween everybody',
 'Three days Michigan Three days until the most important election of our lifetimes Join me and JoeBiden for our',
 'You could be the difference between someone making it out to the polls or staying home And many states could be de',
 'Always great talking to KingJames and MavCarter about everything from this unique NBA season to the importance of']
<ins class="adsbygoogle"
     style="display:block; text-align:center;"
     data-ad-layout="in-article"
     data-ad-format="fluid"
     data-ad-client="ca-pub-4073386782957450"
     data-ad-slot="3414177424"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>

{ Add a Comment }

Holy Quran Word Clouds by word2vec algorithm, python example

طريقة رسم الكلمات المتشابهه من القران الكريم على شكل word cloud باستخدام الفكتور مودلword2vec لحساب الكلمات المتشابهة، يمكنك تحميل المودل واتباع خطوات البرنامج للحصول على نفس النتائج.

Download the model:

from gensim.models import KeyedVectors
from bidi.algorithm import get_display
import arabic_reshaper
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from typing import List, Dict
# function to plot the word cloud 
def plot_word_cloud(word_list: List[str], word_frequency: Dict[str, float]):
    full_string = ' '.join(word_list)
    reshaped_text = arabic_reshaper.reshape(full_string)
    translated_text = get_display(reshaped_text)   
    # Build the Arabic word cloud
    wordc = WordCloud(font_path='tahoma',background_color='white', width=800, height=300).generate(translated_text)
    wordc.fit_words(word_frequency)
        
    # Draw the word cloud
    plt.imshow(wordc)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    
    plt.show()
# load the model
model = KeyedVectors.load('model/quran_w7_m15.bin') 
print("Model loaded")

#check the model size
print ('Number of all words: ', len(model.wv.vocab))
# Enter the word you want to search
Word_to_plot = 'النهار' 

#result size 
retsize = 200
  
temp_tuple = model.most_similar(positive=[Word_to_plot], negative=[], topn = retsize)

similar_words=[i[0] for i in temp_tuple]
        
word_frequency = {}
for word_tuple in temp_tuple:
    reshaped_word = arabic_reshaper.reshape(word_tuple[0])
    key = get_display(reshaped_word)
    word_frequency[key] = word_tuple[1]     
    
 
plot_word_cloud(similar_words, word_frequency)

Result:

# Enter the word you want to search
Word_to_plot = 'كريم' 

#result size 
retsize = 200
# Enter the word you want to search
Word_to_plot = 'النار' 

#result size 
retsize = 200

{ Add a Comment }

Emotion from the text (sadness , joy , fear , anger ) python Example

under editing

download the sample date:

Date and time: Sat Feb 22 04:14:23 +0000 2020

text: RT What a coincidence that Russia always happens to support whoever the Democrat party elites are trying to destroy at any given…

sadness: 0.15790210664272308

joy :0.20329616963863373

fear: 0.47116225957870483

anger: 0.2204379141330719

from nltk.tokenize import word_tokenize
import pandas as pd  
import matplotlib.pyplot as plt

I used Trump tweets :


{ Add a Comment }