Unexpected Challenges Result in Important and Informative Discussions: a transparent discussion about stripping content and stopwords

As described in previous posts, the first year Digital Fellows at CHNM have been working on a project under the Research division that involves collecting, cleaning, and analyzing data from a corpus of THATCamp content. Having overcome the hurdles of writing some python script and using MySQL to grab content from tables in the backend of a WordPress install, we moved on to the relatively straightforward process of running our stripped text files through MALLET.

As we opened the MALLET output files, excited to see the topic models it produced, we were confronted with a problem we didn’t reasonably anticipate and this turned into a rather important discussion about data and meaning.

As a bit of background: topic modeling involves a process of filtering “stopwords” from a data set. Frequently a list of stopwords includes function words, or terms that appear repeatedly in discourse, like: “a, an, the”. These are filtered out because they serve a grammatical purpose but have little lexical meaning. Additionally, errors, misspellings, and lines of code that were skipped in the previous steps can also be filtered out at this stage.

As we opened the file of keys produced by MALLET, we found that some terms appeared that raised questions about what should or should not be included in our analysis. In particular, the discussion centered around spelling errors and function words in Spanish and French.

The conversation that followed, reproduced below, was significant and as people look through the results of this project or consider their own efforts reproduce something like this elsewhere, we’d like to be transparent about the decisions we made and, perhaps, spur a discussion about how to address scenarios like this in the future.

_____________________________________________

Take a look at what MALLET spit out: there are some errors.
Stuff like “xe, xc, zijn, xb, en, la”.

Yeah, I saw that.
We can make a custom stoplist to remove those.
Make a list and we’ll add it.

Interesting keys though – did you see that “women” came up in #0?
I’m excited to see this once it’s all graphed.

There’s some stuff though that I’m not sure we want to remove: Is CAA an error or an abbreviation? “socal” – is that social misspelled or Southern California, abbreviated?
Hmm, that could be an organization..

Yeah, SoCal, as in Southern California.
There was a camp there.
This adds a larger question: do we remove misspellings?
For clarity?

I have mixed feelings.

Me too.

I think it’s appropriate to remove backend stuff-
tags and metadata, but content is not something we should modify.

I agree.
We don’t want to skew the results.
Some of it occurred when we stripped all the alphanumeric stuff out.
It took out apostrophes- causing words like “I’ve” to become “I ve”.

The errors in themselves are telling about the nature of THATCamps.
That the content is generated spontaneously
lends itself to deviations from appropriate spelling ect.

I agree

Look at #17 and 18.

Whoa, where did “humanidades” come from?
Oh, right! There are international conference posts in here too!

How do we handle this?
Its possible to strip out the camps that are not in English
or even to run analysis on them separately.
I don’t want to skew the results but this also throws things off.

I know.
I say we leave it. It shows a growing international influence.
We’ll be able to see the emergence of International THATCamps.

I’ve never run into this before.
It brings up some interesting issues -
I wonder if there is standard procedure for something like this.

What about things like “en” which is Spanish for “is”-
that would have been removed on an English stoplist.
And now function words in Spanish and French
seem to appear more frequently because
the English terms have been filtered out.
How do you do topic modeling with multiple languages?

What about special characters?
We’ve stripped stuff out, how about how that would affect the appearance of words?

We have to find a stoplist for each of the languages.
To strip out the function words of all of them.

Good call. This got complicated quick!

Agreed.

_____________________________________________

As outlined above, when we opened the text file with keys, new questions were raised about the relevance and complications of running a particular stoplist on a corpus of texts. Similarly, we were forced to rethink how we handle misspellings and unfamiliar abbreviations. In the end, we tracked down stoplists for Spanish (and French) terms, so that no function words in any language would skew the results of our analysis. We also carefully examined the keys to identify abbreviations and misspellings and decided that they are a significant contribution to the analysis.

A few questions remained for us: how might removing non-alphanumeric characters (a-z,A-Z,0-9) alter the meaning of special characters used in languages other than English? How have others responded to spelling errors? How significant are errors?

Hopefully a post of this nature will foster discussion and produce a stronger, more complete analysis on this and other documents.

Spring Semester in Research and a THATCamp Challenge

The spring semester is here and the first year DH fellows have begun our rotation into the Research division of CHNM.

To get the ball rolling, we spent a week working through the helpful tutorials at the Programming Historian. As someone new to DH, with admittedly limited technical skill and knowledge, these were immeasurably useful. Each tutorial breaks content into smaller, less intimidating units. These can be completed in succession or selected for a particular topic or skill. While there is useful content for anyone, we focused our attention on Python and Topic Modeling with the aim of solving our own programming dilemma.

Our central challenge was to extract content across the THATCamp WordPress site to enable us to do some text analysis.

Continue reading

Public Projects: Reflection

The past seven weeks have moved really quickly but I have benefitted a great deal from the time we spent in the Public Projects section of CHNM.

Due to my relatively limited technical skills, this section has proven to be the most challenging thus far. However, with some help, and some pretty detailed instructions, I have been expanding my skill set and feel a lot less intimidated by the tools we work with. There are three main projects on which we focused: testing updates in Omeka, transcribing and revisiting documents at the Papers of the War Department and contributing to and testing the National Mall site.

I have deeply enjoyed them all, especially the sunny morning we spent at the National Mall. Additionally, a great deal of our work overlapped with the theoretical reading and discussions of our coursework as digital history scholars. It is rare for theory and application to be balanced, but that was definitely my experience this semester. I was frequently surprised to find applications of class reading at work and often referred to the work done at CHNM during course discussions.

Public Projects was deeply inclusive for us as fellows. I got a real sense of each of the ongoing projects and I learned a great deal about the collaborative work required to produce the resources described above.

Overall, this semester the fellowship has given me a structured place to develop my knowledge and expertise with digital tools, like Omeka and Scripto, and given me a sandbox to play with Git Hub and the command-line (if you know what those things are, you are in a much better place than I was three months ago!)

I’m looking forward to learning more in the semester to come!

Education Dept. Reflection

My time in the education department at CHNM has passed quickly, but it has also been deeply enriching. I’ve learned a lot about the challenges of creating historical scholarship geared toward K-12 students and have come to appreciate the importance of integrating digital media in the classroom. As one can imagine, coming into the Center with limited technical skills can be intimidating, but in these seven weeks the combination of course content and fellowship activities has greatly reduced my concerns.

Continue reading

Introductions

My first introduction to the Center for History and New Media happened without my even realizing it. As a graduate student at Gallaudet University, a professor urgently encouraged us to begin using Zotero and as I rounded the corner on two Masters theses, the value of this tool was not lost on me. Only after I had begun the process of applying to history programs did I realize that my favorite citation tool had its origins here at George Mason University and CHNM.

Continue reading