Sunday, September 13, 2015

Using LiveCode to Create a Word Cloud (Almost) with a Nod Toward Data Mining

OK, it was 5:30 p.m. on Friday afternoon and I was completely spent mentally from a three-hour faculty meeting. I wanted to end the work week on an upbeat note, so I decided to take a few minutes to work with LiveCode. I've spent all of my free time this week working on my Q sort tool, so I wanted to do something brand-new. So, I started a new LiveCode project that had recently been on my mind. On Thursday I had attended an excellent presentation about data mining by one of our very talented doctoral students - Neo Hao - at the Design, Development and Research Conference hosted at the University of Georgia (and chaired by Dr. Rob Branch). Neo's presentation made me think about how easy it would be to build a simple example of a data mining program: a word cloud using the frequency of words in a given passage of text. This post is only the result of about 30 minutes of work. I didn't finish the project, but I was able to build the basics and I think it is an interesting example of using the excellent list processing capabilities of LiveCode. (Interestingly, I worked at least another hour on the program after I started writing this blog on aesthetic stuff, like fonts, labels, and some user feedback in order to make the project "presentable" in a blog posting. That is the way it is always is with software design - the time needed to make a program work is always much less than the time needed to make it usable.)

I'm sure you know what a word cloud is, but if not, here's an example built with wordle.net and based on titles of some of the things I've published over the past few years:



As you can see, a word cloud is simply a listing of all the unique words in the passage with a visual representation of the frequency of the words. The more times the word is used, the bigger its font size. A word cloud is an excellent visual representation of the importance of certain words in a given passage of text, and I think you can get a good, quick, snapshot of what I've been writing about in my published work just by scanning this image. Notice, by the way, the fact that inconsequential words, such as "a" and "an," "the," "and," and the like are not represented. Also notice that punctuation has been stripped out. These become important points for us to consider later on.

Now, don't get your hopes up that I'm going to show how to build a word cloud as elegant as this. In fact, all I've built so far is a little program that takes a passage of text and figures out all of the unique words and computes the number of times they are used. It also strips out all of the words you decide are inconsequential. It also provides some quick summary data, such as the total number of words in the passage of text and the total number of unique words.

Here's a screenshot of the program using Abraham Lincoln's second inaugural address:

[ Get the free LiveCode Community version. ]



How It Works


The card consists of the following main fields displayed left-to-right on the screen:
  • original - this field contains the original passage of text you want to analyze;
  • unique words - this field contains all unique words found in the original passage;
  • word frequencies - this field contains the unique words plus their frequencies
  • ignore - this field (with a reddish background color) lists all words you want to be ignored when identifying unique words.
All of the code is contained within the button "Analyze." The program works with several repeating loops, which I've color coded blue and green:

on mouseUp
   put empty into field "unique words"
   put empty into field "target word"
   put empty into field "word frequencies"
   put empty into field "unique count"

   put empty into field "frequency count"
   put the number of words in field "original" into L
   
   //Search for Unique Words; Remove Punctuation
   put "Finding unique words..." into field "working"  //user feedback
   show field "working"
   repeat with i=1 to L
      put empty into field "target word"
      put word i of field "original" into varTargetWord
      //Strip out any punctuation found in the word
      if the last character of varTargetWord = comma then delete the last character of varTargetWord
      if the last character of varTargetWord = "." then delete the last character of varTargetWord
      if the last character of varTargetWord = ";" then delete the last character of varTargetWord
      if the last character of varTargetWord = "?" then delete the last character of varTargetWord
      if the last character of varTargetWord = "!" then delete the last character of varTargetWord
      if the last character of varTargetWord = ":" then delete the last character of varTargetWord
      if the last character of varTargetWord = quote then delete the last character of varTargetWord
      if the first character of varTargetWord = quote then delete the first character of varTargetWord
      if the first character of varTargetWord = "(" then delete the first character of varTargetWord
      if the last character of varTargetWord = ")" then delete the last character of varTargetWord
      put the number of lines in field "unique words" into LL
      put true into varUniqueWordFound
      
        //Check to see if the word has already been found
      put the number of lines in field "unique words" into LL
      put true into varUniqueWordFound
      repeat with j=1 to LL         
         if varTargetWord = word j of field "unique words" then
            put false into varUniqueWordFound
            exit repeat            
         end if
      end repeat
      if varUniqueWordFound is true then put varTargetWord&return after field "unique words"
   end repeat
   
   //Compute the frequencies of each unique word found
   put "Computing frequencies..." into field "working" //user feedback
   put the number of lines in field "unique words" into L
   repeat with i = 1 to L
      put line i of field "unique words" into varUniqueWord
      //First, look for inconsequential words and strip out
      put the number of lines in field "ignore" into LLL
      put false into varWordToIgnoreFound
      repeat with k= 1 to LLL
         put line k of field "ignore" into varWordToIgnore
         if varUniqueWord=varWordToIgnore then 
            put true into varWordToIgnoreFound
            exit repeat
         end if
      end repeat
      if varWordToIgnoreFound is true then next repeat
      //Ok, begin counting the unique words
      put 0 into varCount
      put the number of words in field "original" into LL
      repeat with j=1 to LL
         put word j of field "original" into varTargetWord         
         //Strip out any punctuation found in the word
         if the last character of varTargetWord = comma then delete the last character of varTargetWord
         if the last character of varTargetWord = "." then delete the last character of varTargetWord
         if the last character of varTargetWord = ";" then delete the last character of varTargetWord
         if the last character of varTargetWord = "?" then delete the last character of varTargetWord
         if the last character of varTargetWord = "!" then delete the last character of varTargetWord
         if the last character of varTargetWord = ":" then delete the last character of varTargetWord
         if the last character of varTargetWord = quote then delete the last character of varTargetWord
         if the first character of varTargetWord = quote then delete the first character of varTargetWord
         if the first character of varTargetWord = "(" then delete the first character of varTargetWord
         if the last character of varTargetWord = ")" then delete the last character of varTargetWord
         if varTargetWord=varUniqueWord then add 1 to varCount
      end repeat
      put varUniqueWord&comma&varCount&return after field "word frequencies"
   end repeat
   
   put "Sorting..." into field "working" //user feedback
   sort lines of field "word frequencies" numeric descending by item 2 of each
   
   put "Done!" into field "working" //user feedback
   wait .5 seconds
   hide field "working"
end mouseUp


The first repeat loop, color-coded light blue, searches for unique words while removing punctuation. It looks at each word in the passage (in the field "original") and checks to see if the word has already been found (in the field "unique"). To do this, a second loop (shown in darker blue) is executed that looks at each word in the field "unique." If it has already been found, the line "exit repeat" is executed and the program stop looking through the unique words already found and goes to the next word in the original passage. It's worth pausing for a moment to study this second loop. As my experience with LiveCode grows, I find interesting patterns in my coding.

The question of "uniqueness" in a given text comes up over and over. Likewise, the strategy that I came up with long ago to identify uniqueness - as shown in this dark blue code - has become a time-tested friend to me. Here's a short explanation:

I begin by creating a true/false variable, in this case I set the variable "varUniqueWordFound" to false. By setting it false at the start, I'm saying that I'm going to assume that the next word I look at is not unique, that is, I expect it to have already been found somewhere previously in the text. In a sense, I'm "daring" the text to prove me wrong. I then do a search of all unique words previously found. Obviously, at the start there are zero words found. Then, as unique words are found, they are added to the text field "unique words." The first loop identifies the words in the original text passage - denoted by the phrase "word i" in the line found near the top of the first loop (in light blue):

put word i of field "original" into varTargetWord"

So, the variable "varTargetWord" contains this ever-changing target word. The second loop compares the target word to each and every unique word already found. If a match is found (at any point) two things immediately happen: 1) I put true into the variable "varUniqueWordFound" because, after all, it is true that a unique word was found; and 2) I exit the current repeat, which in this case is the second loop (in dark blue). So, if a unique word was NOT found, then the variable "varUniqueWordFound" remains false which kicks in the last line of code shown in dark blue: 

if varUniqueWordFound is true then put varTargetWord&return after field "unique words"

That line instructs the program to add the current target word to the field of unique words. The first loop (in light blue) then moves on to the next word in the passage and the operation is repeated.

Calculating Frequency


OK, let's move on. The program now counts up how many times each unique word is used in the original passage. This code is shown in green. As it goes, it also checks to see which words it should ignore as inconsequential, as listed in the field "ignore." (And you need to be careful about what you consider to be an inconsequential word, a point I'll return to at the end of this post.) Every time it finds that word in the original passage, it adds 1 to the variable varCount. As it checks each word in the original passage, it must again strip out any punctuation found with the word. So, I just repeat the code for this operation used above. (Any time you do something more than once, that is an indication that you really should create a custom function, which I really should do if I wasn't so lazy.) After it goes through each and every word in the original passage, it puts the unique word and the frequency in the third field "word frequencies." This step takes the most time for the program to execute.

Finally, I sort the field "word frequencies" in order of how often the word appears from most to least (this code is shown in red). However, I added some buttons that give the user the option of sorting the text also alphabetically.

Along the way, I add some feedback to let the user know how things are progressing.

Next Steps


The program above generates a list of all unique words in any given passage of text and the frequency that they appear. This is the raw material needed to create a word cloud. Although I'm not sure if I'll have time (or inclination) to work further on this project, here are some ideas on what needs to be done next.

Each unique word needs to be put into an object, such as a button or a field. The text size for that object would depend on the frequency of the word. There would have to be a given maximum font size, let's say 24 pt. Whatever is the most frequent word or words would be given that maximum font size. On the other side of the spectrum, all words with a frequency of 1 would be given the minimum font size, which we'll say is 10. Or, one might decide that only words that have been mentioned more than once or twice get displayed in the word cloud. (Lincoln only mentioned the word "Bible" once, but I think its one mention is very significant.) Either way, the smallest font size would be determined for a group of words. Then, all remaining words would be given one of the other levels of available font size, depending on their frequency. So, some algorithm would need to be devised that divides up all remaining words along this font size spectrum, which would not be hard to do.

The tricky part would be to arrange these objects in some sort of pleasant visual display. Frankly, I'm not sure how I would do that!

So, feel free to download my code to this project and give it a try. (And don't forget to send me a copy.)

So, What's the Connection to Data Mining?


This little Friday afternoon project is a simple example of data mining in that it is analyzes each and every word in a given passage and reports some statistics on that passage in a way that I, as a mere human, could not do effectively on my own. It's easy to imagine more sophisticated things you might investigate. Perhaps you are interested in the presence of certain words or combinations of words in a passage of text. For example, I find it very interesting that Abraham Lincoln only used the word "I" once in his second inaugural. (This fact makes it clear that it would be unwise to assume that all small words are inconsequential.) A teacher may find it useful and revealing to analyze essays on a specific topic submitted by all students in a class. You might want to see if some key words are mentioned, where they are in the essay, or how far apart they are in the essay. This is still not a substitute for actually reading the essays, but it's easy for me to see how analyzing even this small mountain of text with cleverly written algorithms would aid the teacher's overall assessment of the class's writing. But, what I think is the most important characteristic of data mining is that the computer will happily do this analyzing for thousands upon thousands of text passages quickly and errorlessly. It will reveal patterns that may subsequently uncover meaning if the algorithm is appropriate and the person interpreting the data is astute.

And, as Neo explained in his talk on Thursday, because of public APIs we all have access to mountains of data from our twitter or facebook feeds. The companies themselves have access to all of these data and you can be sure those companies are mining their data with extraordinary precision. Is their intent noble or malevolent? I really don't know, but it's easy to speculate that it is somewhere in the middle. I find it uplifting that Google could track the spread of flu in almost real-time from the search data of people who were feeling ill, and not in weeks as is needed by the Centers for Disease Control (see the book Big Data by Mayer-Schonberger and Cukier). I think it is time for rank-and-file educators to be part of this conversation.

Appendix: Frequency of Unique Words in Abraham Lincoln's Second Inaugural Address



Sorted by Frequency Sorted Alphabetically
war,11 absorbs,1
all,10 accept,1
we,6 achieve,1
but,5 address,2
God,5 against,1
His,5 agents,1
shall,5 ago,2
than,4 aid,1
years,4 all,10
Union,4 Almighty,1
Both,4 already,1
let,4 altogether,2
do,4 always,1
were,3 American,1
would,3 among,1
other,3 another,1
interest,3 answered,2
right,3 anticipated,1
Neither,3 anxiously,1
has,3 any,2
may,3 appearing,1
us,3 appointed,1
Woe,3 arms,1
offenses,3 ascribe,1
must,3 ask,1
those,3 assistance,1
there,2 astounding,1
less,2 attained,1
occasion,2 attention,1
address,2 attributes,1
Now,2 avert,1
four,2 away,1
public,2 battle,1
have,2 because,1
been,2 been,2
every,2 before,1
still,2 being,1
nation,2 believers,1
could,2 Bible,1
hope,2 bind,1
no,2 blood,1
ago,2 bondsman's,1
While,2 borne,1
altogether,2 Both,4
without,2 bread,1
one,2 but,5
rather,2 called,1
came,2 came,2
slaves,2 care,1
cause,2 cause,2
even,2 cease,2
conflict,2 charity,1
cease,2 cherish,1
should,2 chiefly,1
Each,2 city,1
same,2 civil,1
pray,2 claimed,1
any,2 colored,1
just,2 come,2
answered,2 cometh,1
needs,2 conflict,2
come,2 constantly,1
whom,2 constituted,1
offense,2 contest,1
If,2 continue,1
He,2 continued,1
wills,2 corresponding,1
gives,2 could,2
Him,2 course,1
until,2 dare,1
drawn,2 declarations,1
said,2 delivered,1
second,1 departure,1
appearing,1 depends,1
take,1 deprecated,1
oath,1 destroy,1
Presidential,1 detail,1
office,1 devoted,1
extended,1 directed,1
first,1 discern,1
statement,1 dissolve,1
somewhat,1 distributed,1
detail,1 divide,1
course,1 divine,1
pursued,1 do,4
seemed,1 drawn,2
fitting,1 dreaded,1
proper,1 drop,1
expiration,1 due,1
during,1 duration,1
declarations,1 during,1
constantly,1 Each,2
called,1 easier,1
forth,1 effects,1
point,1 else,1
phase,1 encouraging,1
great,1 energies,1
contest,1 engrosses,1
absorbs,1 enlargement,1
attention,1 even,2
engrosses,1 every,2
energies,1 expected,1
little,1 expiration,1
new,1 extend,1
presented,1 extended,1
progress,1 faces,1
our,1 fervently,1
arms,1 fifty,1
upon,1 finish,1
else,1 firmness,1
chiefly,1 first,1
depends,1 fitting,1
well,1 Fondly,1
known,1 forth,1
myself,1 four,2
I,1 fully,1
trust,1 fundamental,1
reasonably,1 future,1
satisfactory,1 generally,1
encouraging,1 gives,2
high,1 God,5
future,1 God's,1
prediction,1 Government,1
regard,1 great,1
ventured,1 has,3
corresponding,1 have,2
thoughts,1 having,1
anxiously,1 He,2
directed,1 high,1
impending,1 Him,2
civil,1 His,5
dreaded,1 hope,2
sought,1 hundred,1
avert,1 I,1
inaugural,1 If,2
being,1 impending,1
delivered,1 inaugural,1
place,1 insurgent,1
devoted,1 insurgents,1
saving,1 interest,3
insurgent,1 invokes,1
agents,1 itself,1
city,1 judge,1
seeking,1 judged,1
destroy,1 judgements,1
war—seeking,1 just,2
dissolve,1 knew,1
divide,1 known,1
effects,1 lash,1
negotiation,1 lasting,1
parties,1 less,2
deprecated,1 let,4
them,1 little,1
make,1 living,1
survive,1 localized,1
accept,1 looked,1
perish,1 Lord,1
One-eighth,1 magnitude,1
whole,1 make,1
population,1 malice,1
colored,1 man,1
distributed,1 may,3
generally,1 men,1
over,1 men's,1
localized,1 might,1
southern,1 mighty,1
part,1 more,1
constituted,1 must,3
peculiar,1 myself,1
powerful,1 nation,2
knew,1 nation's,1
somehow,1 nations,1
strengthen,1 needs,2
perpetuate,1 negotiation,1
extend,1 Neither,3
object,1 new,1
insurgents,1 no,2
rend,1 none,1
Government,1 North,1
claimed,1 Now,2
more,1 oath,1
restrict,1 object,1
territorial,1 occasion,2
enlargement,1 offense,2
party,1 offenses,3
expected,1 office,1
magnitude,1 one,2
duration,1 One-eighth,1
already,1 orphan,1
attained,1 other,3
anticipated,1 our,1
might,1 ourselves,1
before,1 over,1
itself,1 own,1
looked,1 paid,1
easier,1 part,1
triumph,1 parties,1
result,1 party,1
fundamental,1 pass,1
astounding,1 peace,1
read,1 peculiar,1
Bible,1 perish,1
invokes,1 perpetuate,1
aid,1 phase,1
against,1 piled,1
seem,1 place,1
strange,1 point,1
men,1 population,1
dare,1 powerful,1
ask,1 pray,2
God's,1 prayers,1
assistance,1 prediction,1
wringing,1 presented,1
bread,1 Presidential,1
sweat,1 progress,1
men's,1 proper,1
faces,1 providence,1
judge,1 public,2
judged,1 purposes,1
prayers,1 pursued,1
fully,1 rather,2
Almighty,1 read,1
own,1 reasonably,1
purposes,1 regard,1
unto,1 remove,1
world,1 rend,1
because,1 restrict,1
man,1 result,1
cometh,1 right,3
suppose,1 righteous,1
American,1 said,2
slavery,1 same,2
providence,1 satisfactory,1
having,1 saving,1
continued,1 scourge,1
through,1 second,1
appointed,1 see,1
time,1 seeking,1
remove,1 seem,1
North,1 seemed,1
South,1 shall,5
terrible,1 should,2
due,1 slavery,1
discern,1 slaves,2
therein,1 somehow,1
departure,1 somewhat,1
divine,1 sought,1
attributes,1 South,1
believers,1 southern,1
living,1 speedily,1
always,1 statement,1
ascribe,1 still,2
Fondly,1 strange,1
fervently,1 strengthen,1
mighty,1 strive,1
scourge,1 sunk,1
speedily,1 suppose,1
pass,1 survive,1
away,1 sweat,1
continue,1 sword,1
wealth,1 take,1
piled,1 terrible,1
bondsman's,1 territorial,1
two,1 than,4
hundred,1 them,1
fifty,1 there,2
unrequited,1 therein,1
toil,1 those,3
sunk,1 thoughts,1
drop,1 thousand,1
blood,1 three,1
lash,1 through,1
paid,1 time,1
another,1 toil,1
sword,1 toward,1
three,1 triumph,1
thousand,1 true,1
judgements,1 trust,1
Lord,1 two,1
true,1 Union,4
righteous,1 unrequited,1
malice,1 until,2
toward,1 unto,1
none,1 up,1
charity,1 upon,1
firmness,1 us,3
see,1 ventured,1
strive,1 war,11
finish,1 war—seeking,1
work,1 we,6
bind,1 wealth,1
up,1 well,1
nation's,1 were,3
wounds,1 While,2
care,1 who,1
who,1 whole,1
borne,1 whom,2
battle,1 widow,1
widow,1 wills,2
orphan,1 without,2
achieve,1 Woe,3
cherish,1 work,1
lasting,1 world,1
peace,1 would,3
among,1 wounds,1
ourselves,1 wringing,1
nations,1 years,4







3 comments:

  1. Thanks for sharing! It's helpful. I agree that you should have created a function to get rid of punctuations. I also find that there are other words which are better ignored than kept, such as all, but, shall, has, had... They do not convey any substantial meaning. BTW the subject in Lincoln's speech is pretty much obvious, from the most frequent word "war". And you don't see the word "civil" which was put together with "war" quite often by the later generations. Thus I would assume that Lincoln did not consider the Confederate as part of the States back then?

    ReplyDelete
  2. Hi Lloyd,
    Just stumbled upon this blog. I did something similar almost exactly a year ago - check out my blog post: https://livecode.com/head-in-word-clouds/

    Feel free to plunder that code to make word clouds from your data!

    Ali Lloyd

    ReplyDelete
  3. Thanks, Ali, for sharing your blog post with me. I worked on creating the actual visual part of the word cloud from the word count frequencies and am about to post an update on my efforts. However, I checked out your word cloud algorithm briefly and think it is excellent - a much better and much more sophisticated approach. I will point people to your blog post in my soon to be published update. Thanks again - Lloyd

    ReplyDelete