I'm sure you know what a word cloud is, but if not, here's an example built with wordle.net and based on titles of some of the things I've published over the past few years:
As you can see, a word cloud is simply a listing of all the unique words in the passage with a visual representation of the frequency of the words. The more times the word is used, the bigger its font size. A word cloud is an excellent visual representation of the importance of certain words in a given passage of text, and I think you can get a good, quick, snapshot of what I've been writing about in my published work just by scanning this image. Notice, by the way, the fact that inconsequential words, such as "a" and "an," "the," "and," and the like are not represented. Also notice that punctuation has been stripped out. These become important points for us to consider later on.
Now, don't get your hopes up that I'm going to show how to build a word cloud as elegant as this. In fact, all I've built so far is a little program that takes a passage of text and figures out all of the unique words and computes the number of times they are used. It also strips out all of the words you decide are inconsequential. It also provides some quick summary data, such as the total number of words in the passage of text and the total number of unique words.
Here's a screenshot of the program using Abraham Lincoln's second inaugural address:
How It Works
The card consists of the following main fields displayed left-to-right on the screen:
- original - this field contains the original passage of text you want to analyze;
- unique words - this field contains all unique words found in the original passage;
- word frequencies - this field contains the unique words plus their frequencies
- ignore - this field (with a reddish background color) lists all words you want to be ignored when identifying unique words.
put empty into field "unique count"
put empty into field "frequency count"
The question of "uniqueness" in a given text comes up over and over. Likewise, the strategy that I came up with long ago to identify uniqueness - as shown in this dark blue code - has become a time-tested friend to me. Here's a short explanation:
I begin by creating a true/false variable, in this case I set the variable "varUniqueWordFound" to false. By setting it false at the start, I'm saying that I'm going to assume that the next word I look at is not unique, that is, I expect it to have already been found somewhere previously in the text. In a sense, I'm "daring" the text to prove me wrong. I then do a search of all unique words previously found. Obviously, at the start there are zero words found. Then, as unique words are found, they are added to the text field "unique words." The first loop identifies the words in the original text passage - denoted by the phrase "word i" in the line found near the top of the first loop (in light blue):
OK, let's move on. The program now counts up how many times each unique word is used in the original passage. This code is shown in green. As it goes, it also checks to see which words it should ignore as inconsequential, as listed in the field "ignore." (And you need to be careful about what you consider to be an inconsequential word, a point I'll return to at the end of this post.) Every time it finds that word in the original passage, it adds 1 to the variable varCount. As it checks each word in the original passage, it must again strip out any punctuation found with the word. So, I just repeat the code for this operation used above. (Any time you do something more than once, that is an indication that you really should create a custom function, which I really should do if I wasn't so lazy.) After it goes through each and every word in the original passage, it puts the unique word and the frequency in the third field "word frequencies." This step takes the most time for the program to execute.
So, What's the Connection to Data Mining?
This little Friday afternoon project is a simple example of data mining in that it is analyzes each and every word in a given passage and reports some statistics on that passage in a way that I, as a mere human, could not do effectively on my own. It's easy to imagine more sophisticated things you might investigate. Perhaps you are interested in the presence of certain words or combinations of words in a passage of text. For example, I find it very interesting that Abraham Lincoln only used the word "I" once in his second inaugural. (This fact makes it clear that it would be unwise to assume that all small words are inconsequential.) A teacher may find it useful and revealing to analyze essays on a specific topic submitted by all students in a class. You might want to see if some key words are mentioned, where they are in the essay, or how far apart they are in the essay. This is still not a substitute for actually reading the essays, but it's easy for me to see how analyzing even this small mountain of text with cleverly written algorithms would aid the teacher's overall assessment of the class's writing. But, what I think is the most important characteristic of data mining is that the computer will happily do this analyzing for thousands upon thousands of text passages quickly and errorlessly. It will reveal patterns that may subsequently uncover meaning if the algorithm is appropriate and the person interpreting the data is astute.
And, as Neo explained in his talk on Thursday, because of public APIs we all have access to mountains of data from our twitter or facebook feeds. The companies themselves have access to all of these data and you can be sure those companies are mining their data with extraordinary precision. Is their intent noble or malevolent? I really don't know, but it's easy to speculate that it is somewhere in the middle. I find it uplifting that Google could track the spread of flu in almost real-time from the search data of people who were feeling ill, and not in weeks as is needed by the Centers for Disease Control (see the book Big Data by Mayer-Schonberger and Cukier). I think it is time for rank-and-file educators to be part of this conversation.
Appendix: Frequency of Unique Words in Abraham Lincoln's Second Inaugural Address
|Sorted by Frequency||Sorted Alphabetically|