LearningLiveCode: September 2015

Wednesday, September 30, 2015

Update on Using LiveCode to Build a Word Cloud: A Cloud is Forming!

I spent about two more hours on my little word cloud project. I knew that the next step of building the actual word cloud from the list of unique words and frequencies was definitely within reach. Here's an example of one of the first word clouds I built using the text of Abraham Lincoln's second inaugural address:

There is a "mild" hack at play here. The words simply go to a random spot within this square area. Each separate field containing a word has the script "grab me" on mousedown so that I can easily move the words to a more aesthetically pleasing location. I decided it wasn't worth trying to figure out ways to keeps all of the words from overlapping, etc. However, for an excellent example of how to accomplish this, check out the blog posting by Ali Lloyd on his efforts of building a word cloud. Ali is one of the excellent professionals who work at RunRev (the parent company of LiveCode). Ali's solution is exactly what you think of when you think of a word cloud. It has words of different sizes and colors with different orientations filling every nook and cranny. It's really marvelous. So, many thanks to Ali for sharing this link with me in his comment to my previous blog posting on this topic. I'll be studying his code for some time to come. (And any script that uses sines and cosines makes me want to purr.)

How I Did It

OK, back to my humble attempt. One of the challenging parts to the project was figuring out the step-wise progression of font sizes. The above word cloud looks OK, but I was able to improve the word cloud algorithm in several fundamental ways. All of the code to build the word cloud is in the green button "Build Word Cloud" shown at the bottom of this post, but here are a few key highlights.

Font Size Step-Wise Progression

I perfected the step-wise progression of the font size so that the word with the highest frequency of use had a font size of 96 pixels. In the example above, the font size is directly proportional to the frequency. That was inadequate for many reasons, the most obvious is that the font sizes can be radically different for just for the first few most frequently used words. So, I revised the script so that the second next most frequently used word had a font size of 12 pixels smaller, no matter how fewer times it was used, and so on. Let me explain a little further. If one word was mentioned 100 times, but the next most frequently mentioned word was used only 20 times, then my revised script would give the second word a font size of only 12 pixels smaller. Think of 12 pixels as the height of the "step." From a data visualization standpoint, that skews the proportion in an inappropriate way, but it makes for a more aesthetically pleasing outcome. So, I think it all depends on what the purpose of the word cloud is. For scientific purposes, it is inadequate because it skews the output, but for a quick visual to get the gist of what's going on in a passage of text that is pleasing to the eye, it's fine. I also made 12 pixels the smallest font size that would be used. (In my original script, it was possible to have one word be 96 pixels and all remaining words 12 pixels if the most frequently used word was mentioned an inordinate amount times as compared to all other words.)

Adding Color

I also added the option to pick a color at random using the following code:

    put random (255) into rColor  
    put random (255) into gColor  
    put random (255) into bColor


    if varColor is true then   
      set the foregroundcolor of it to rColor,gColor,bColor  
    else  
      set the foregroundcolor of it to black       
    end if

You'll need to scan the code below for these lines in the button script. The first three lines just pick three numbers at random from 0 to 255. These are used to produce a random RGB color if the "Color" option is checked.

These changes produced the following word cloud:

I know my graphic design skills don't qualify me to give any "expert" opinion, but it definitely seems like an improvement to me, at least aesthetically. Again, though, the question of whether you get a more accurate representation of the data is an important question to ask.

Another improvement was the use of the "font step" slider at the top of the screen. When I ran the algorithm with different text passages, I found that 12 was not always the optimal number for the font step. I decided it was better to let the user experiment with this. A final minor improvement, I think, is making sure that the two words with the highest frequency are always shown in black for added emphasis. Here's a screen shot of the card that builds the word cloud.

I Found Two Golden Nuggets: formattedWidth and formattedHeight

One of the wonderful outcomes of building this project is discovering the "formattedWidth" property. This property does all of the hard work of figuring out exactly how wide or how tall a text field needs to be for its contents to fit perfectly within it. I didn't know about this property when I was first building my Q Sort project, so I came up with my own function- very imperfect - to try to do the same thing. I've since updated my Q Sort app to use this property. Here are the two key lines of code that accomplish this feat:

    set the width of field "word object" to the formattedWidth of field "word object"  
    set the height of field "word object" to the formattedHeight of field "word object"

Next Steps

As I looked at Ali Lloyd's code, it reminded me of Richard Gaskin's advice to me almost a year ago to use the "repeat for each" form of going through a list of data rather than the "repeat with" approach that I have become so fond of. My code that computes the frequency of the words in a passage of text is painfully slow, so I'm now very motivated to get with it and try out the "repeat with" approach. So, look for at least one more update on this little project.

As always, the bottom-line for me is that I continue to learn new things every time I build a LiveCode project, however small. But, is there really any other way?

Script on the Button "Build Word Cloud":

 on mouseUp  
   //Erase any existing word cloud first  
   put the number of fields into L  
   repeat with i = 5 to L-1  
    put i-4 into j  
    put item 1 of line j of field "word frequencies" into varFieldName  
    put varFieldName into message  
    delete field varFieldName  
   end repeat  
   put false into varColor  
   if the hilite of button "color" is true then put true into varColor  
   put the thumbposition of scrollbar "fontdifferencebar" into varFontChangeAmount  
   //Build the word cloud  
   set the movespeed to 0  
   //Determine the largest frequency - this will get the largest font size in the word cloud  
   put item 2 of line 1 of field "word frequencies" into varMaxFrequency  
   put 0 into varFontDifference  
   put field "minimum frequency" into varMinFrequency  
   //This is the repeat loop that will create each word, resize its text size, then move it  
   repeat with i = 1 to the number of lines in field "word frequencies"  
    put random (255) into rColor  
    put random (255) into gColor  
    put random (255) into bColor  
    if item 2 of line i of field "word frequencies" < varMaxFrequency then  
      add 1 to varFontDifference  
      put item 2 of line i of field "word frequencies" into varMaxFrequency  
    end if  
    if item 2 of line i of field "word frequencies" < varMinFrequency then exit repeat  
    //The next two lines determine the area of the screen where word cloud will be built  
    put random(300)+100 into x  
    put random(300)+100 into y  
    //Create the next word for the word cloud  
    copy field "word object" on card "library" to this card  
    hide it  
    if varColor is true then   
      set the foregroundcolor of it to rColor,gColor,bColor  
    else  
      set the foregroundcolor of it to black       
    end if  
    if varFontDifference<2 then set the foregroundcolor of it to black  
    put item 1 of line i of field "word frequencies" into field "word object"  
    //Determine font height for the word  
    //put varMaxFrequency - item 2 of line i of field "word frequencies" into varFontDifference  
    put 96-(varFontDifference*varFontChangeAmount) into varTextSize  
    if varTextSize < 12 then put 12 into varTextSize  
    set the textSize of field "word object" to varTextSize  
    set the width of field "word object" to the formattedWidth of field "word object"  
    set the height of field "word object" to the formattedHeight of field "word object"  
    //Rename the newly copied field as the word it contains  
    set name of field "word object" to item 1 of line i of field "word frequencies"  
    //Move the word to a random spot within the word cloud screen area  
    move it to x,y in 1 millisecond  
    show it  
   end repeat  
 end mouseUp

Postscript: About My Formatted Code

Ali Lloyd's post reminded me that the way I've been showing code in my blog posts has been really terrible, so I did the obvious thing and googled "showing code in a blog post using blogger" and quickly found a great tool:

http://codeformatter.blogspot.com/

The much improved formatting above is the result.

Sunday, September 13, 2015

Using LiveCode to Create a Word Cloud (Almost) with a Nod Toward Data Mining

OK, it was 5:30 p.m. on Friday afternoon and I was completely spent mentally from a three-hour faculty meeting. I wanted to end the work week on an upbeat note, so I decided to take a few minutes to work with LiveCode. I've spent all of my free time this week working on my Q sort tool, so I wanted to do something brand-new. So, I started a new LiveCode project that had recently been on my mind. On Thursday I had attended an excellent presentation about data mining by one of our very talented doctoral students - Neo Hao - at the Design, Development and Research Conference hosted at the University of Georgia (and chaired by Dr. Rob Branch). Neo's presentation made me think about how easy it would be to build a simple example of a data mining program: a word cloud using the frequency of words in a given passage of text. This post is only the result of about 30 minutes of work. I didn't finish the project, but I was able to build the basics and I think it is an interesting example of using the excellent list processing capabilities of LiveCode. (Interestingly, I worked at least another hour on the program after I started writing this blog on aesthetic stuff, like fonts, labels, and some user feedback in order to make the project "presentable" in a blog posting. That is the way it is always is with software design - the time needed to make a program work is always much less than the time needed to make it usable.)

I'm sure you know what a word cloud is, but if not, here's an example built with wordle.net and based on titles of some of the things I've published over the past few years:

As you can see, a word cloud is simply a listing of all the unique words in the passage with a visual representation of the frequency of the words. The more times the word is used, the bigger its font size. A word cloud is an excellent visual representation of the importance of certain words in a given passage of text, and I think you can get a good, quick, snapshot of what I've been writing about in my published work just by scanning this image. Notice, by the way, the fact that inconsequential words, such as "a" and "an," "the," "and," and the like are not represented. Also notice that punctuation has been stripped out. These become important points for us to consider later on.

Now, don't get your hopes up that I'm going to show how to build a word cloud as elegant as this. In fact, all I've built so far is a little program that takes a passage of text and figures out all of the unique words and computes the number of times they are used. It also strips out all of the words you decide are inconsequential. It also provides some quick summary data, such as the total number of words in the passage of text and the total number of unique words.

Here's a screenshot of the program using Abraham Lincoln's second inaugural address:

Get My LiveCode file

[ Get the free LiveCode Community version. ]

How It Works

The card consists of the following main fields displayed left-to-right on the screen:

original - this field contains the original passage of text you want to analyze;
unique words - this field contains all unique words found in the original passage;
word frequencies - this field contains the unique words plus their frequencies
ignore - this field (with a reddish background color) lists all words you want to be ignored when identifying unique words.

All of the code is contained within the button "Analyze." The program works with several repeating loops, which I've color coded blue and green:

on mouseUp

put empty into field "unique words"

put empty into field "target word"

put empty into field "word frequencies"
put empty into field "unique count"

put empty into field "frequency count"

put the number of words in field "original" into L

//Search for Unique Words; Remove Punctuation

put "Finding unique words..." into field "working" //user feedback

show field "working"

repeat with i=1 to L

put empty into field "target word"

put word i of field "original" into varTargetWord

//Strip out any punctuation found in the word

if the last character of varTargetWord = comma then delete the last character of varTargetWord

if the last character of varTargetWord = "." then delete the last character of varTargetWord

if the last character of varTargetWord = ";" then delete the last character of varTargetWord

if the last character of varTargetWord = "?" then delete the last character of varTargetWord

if the last character of varTargetWord = "!" then delete the last character of varTargetWord

if the last character of varTargetWord = ":" then delete the last character of varTargetWord

if the last character of varTargetWord = quote then delete the last character of varTargetWord

if the first character of varTargetWord = quote then delete the first character of varTargetWord

if the first character of varTargetWord = "(" then delete the first character of varTargetWord

if the last character of varTargetWord = ")" then delete the last character of varTargetWord

put the number of lines in field "unique words" into LL

put true into varUniqueWordFound

//Check to see if the word has already been found

put the number of lines in field "unique words" into LL

put true into varUniqueWordFound

repeat with j=1 to LL

if varTargetWord = word j of field "unique words" then

put false into varUniqueWordFound

exit repeat

end if

end repeat

if varUniqueWordFound is true then put varTargetWord&return after field "unique words"

end repeat

//Compute the frequencies of each unique word found

put "Computing frequencies..." into field "working" //user feedback

put the number of lines in field "unique words" into L

repeat with i = 1 to L

put line i of field "unique words" into varUniqueWord

//First, look for inconsequential words and strip out

put the number of lines in field "ignore" into LLL

put false into varWordToIgnoreFound

repeat with k= 1 to LLL

put line k of field "ignore" into varWordToIgnore

if varUniqueWord=varWordToIgnore then

put true into varWordToIgnoreFound

exit repeat

end if

end repeat

if varWordToIgnoreFound is true then next repeat

//Ok, begin counting the unique words

put 0 into varCount

put the number of words in field "original" into LL

repeat with j=1 to LL

put word j of field "original" into varTargetWord