THATCamp Mallet Results

We have spent the last few weeks working to build a python script that would allow us to download and prep all of the THATCamp blog posts for topic modeling in MALLET (for those catching up, we detailed this process in a series of previous posts). As our last post detailed, we encountered a few more complications than expected due to foreign languages in the corpus of the text.  After some discussion, we worked through these issues and were able to add stoplists to the script for German, French, and Spanish.  Although this didn’t solve all of our issues and some terms do still show up (we didn’t realize there was Dutch too), it led to some interesting discussion about the methodology behind topic modeling.  Finally we were able to rerun the python script with the new stopwords and then feed this new data into MALLET.

MALLET, or MAchine Learning for Language Toolkit, is an open source java package that can be used for natural language processing.  We used the Programming Historian’s tutorial on MALLET.  Topic modeling is an important digital tool that analyzes a corpus of text and seeks to extract ‘topics’ or sets of words that are statistically relevant to each other.  The result is a particular number of word sets also known as “topics.” In our case we asked MALLET to return twenty topics based on our set of THATCamp blog posts. The topics returned by MALLET were:

  1. xa digital art history research university scholarship graduate field center publishing open today institute cultural knowledge professor online world
  2. university games pm humanities http digital september knowledge kansas saturday game conference state registration play information representation workshop boise
  3. thatcamp session sessions day participants free technology page unconference university nwe conference discussion information google propose hope event proposals
  4. people make time questions things idea access process ideas world work great making lot build add kind interesting nthe
  5. digital humanities data tools text projects research scholars omeka texts tool analysis scholarly archive reading online based book scholarship
  6. digital humanities session dh library libraries support projects open discussion librarians amp talk work journal sessions propose faculty list
  7. history public digital historical collections museum media project projects mobile online maps museums collection historians users sites site applications
  8. games zotero thinking place game code end cultural chnm hack year documentation humanists version number pretty application visualization set
  9. session open area data workshop tool knowledge teach interested bay prime bootcamp gis workshops reality night thatcampva virginia lab
  10. work interested students ways teaching post working talk writing blog love issues don conversation create collaborative thinking start discuss
  11. project web content information tools community resources archives experience research create learn creating learning share development materials specific provide
  12. xb xa del se humanidades digitales xad al madrid www mi este aires buenos digital personas taller cuba parte
  13. caption online id align width open attachment accessibility women read university american building accessible gender media november floor race
  14. data http org session www open twitter texas good wikipedia nhttp status wiki start commons drupal metadata people crowd
  15. xa workshop session omeka publishing http gt propose org workshops friday docs open hands amp doc studies topic discuss
  16. students digital learning technology education media college faculty humanities research game pedagogy student courses classroom assignments skills arts social
  17. xa oral digital humanities video event local application community offer interviews planning center education software jewish weekend college histories
  18. een het voor op te zijn deze met workshop kunnen om digitale bronnen data onderzoek historici nl wat worden
  19. social media technology studies arts performance museums xcf play participants cultural performing reading st email object platforms interaction technologies
  20. xa thatcamp org http thatcamps details read movement published planned access nthatcamp browse software follow break series google join

As you can see we have an impressive list of terms. Before we organize them in a meaningful way, we will briefly point out a common problem that scholars may confront when working with MALLET. As you may notice, we realized that we have quite a few errors such as ‘xa’ that appear in the results.  While we don’t have a great answer for why this is, we think it has to do with complex encoding issues related to moving content from a WordPress post that is stored in a MySQL database using Python. Each of these uses a different coding system and the error appears to be related to non-breaking spaces.  A little bit of Googling revealed that the non-breaking space character used by WordPress is ‘&nbsp’ which is different that the ASCII encoding of a non-breaking space ‘/xa0’.  When Python reads WordPress’s non-breaking space character ‘&nbsp’, it understands the space but encodes it as the UTF-8 version ‘/xa0’.  As second year fellow Spencer Roberts explained the issue is that meaning is lost in translation. He used this analogy: Python reads and understands the French word for “dog” then translates it and returns the English word.

In this case, what shows up in our results is not ‘/xa0’ but rather ‘xa’ because we had stripped out all of the non-alphanumeric characters prior to running the data through MALLET.  We think the errors such as ‘xa’ and ‘xb’ are because of these encoding issues.  Anyone interested in clarifying or continuing this discussion with us can do so in the comments.

Returning to our MALLET results, our next challenge was to present and analyze the large amount of data.  We drew from both Cameron Blevins and Robert K. Nelson in our approach and decided to group the topics by theme so that trends could be more easily identified.  We determined that there were about seven broad themes in the corpus of THATCamp blog posts from 2008 to present:

  1. Accessibility
  2. Building
  3. Community
  4. International
  5. Pedagogy
  6. Public Digital Humanities
  7. THATCamp Structure

Utilizing these larger categories, we were able to create several charts that demonstrate the changes over time with the THATCamps. The charts are available below; you’ll note that we have graphed them using percentages. The percentages that appear represent the number of times that topic occurred within the posts at that camp.

Chart of Topics Overall

Topics Overall

Chart of the topic Accessibility.

Topics relating to Accessibility

Graph of the Community Topic

Topics relating to Community

Graph of the THATCamp Structure Topics

Topics relating to THATCamp Structure

Graph of topics relating to Pedagogy

Topics relating to Pedagogy

Graph of topics relating to Public Digital Humanities

Topics relating to Public Digital Humanities

Graph of topics relating to building in the humanities.

Topics relating to Building

Topics relating to the international influence of THATCamp.

International Digital Humanities

We found these results to be particularly interesting. A larger overall conclusion is that THATCamp content emphasizes the various applications of digital technology to scholarship, from public uses to tool building or teaching. Since THATCamp was founded, it has become a more varied community. However close examination of the topic models this exercise produced reveals that a number of the same terms appear frequently across all of the topic models (“digital”, for instance, appears in 8 of the 20 topics). This references the way in which ideas are circulated throughout camps and unifies the community. It also reflects the subjects that are the focus of the community.

If you’re interested in the data, you can view the various files here:

Unexpected Challenges Result in Important and Informative Discussions: a transparent discussion about stripping content and stopwords

As described in previous posts, the first year Digital Fellows at CHNM have been working on a project under the Research division that involves collecting, cleaning, and analyzing data from a corpus of THATCamp content. Having overcome the hurdles of writing some python script and using MySQL to grab content from tables in the backend of a WordPress install, we moved on to the relatively straightforward process of running our stripped text files through MALLET.

As we opened the MALLET output files, excited to see the topic models it produced, we were confronted with a problem we didn’t reasonably anticipate and this turned into a rather important discussion about data and meaning.

As a bit of background: topic modeling involves a process of filtering “stopwords” from a data set. Frequently a list of stopwords includes function words, or terms that appear repeatedly in discourse, like: “a, an, the”. These are filtered out because they serve a grammatical purpose but have little lexical meaning. Additionally, errors, misspellings, and lines of code that were skipped in the previous steps can also be filtered out at this stage.

As we opened the file of keys produced by MALLET, we found that some terms appeared that raised questions about what should or should not be included in our analysis. In particular, the discussion centered around spelling errors and function words in Spanish and French.

The conversation that followed, reproduced below, was significant and as people look through the results of this project or consider their own efforts reproduce something like this elsewhere, we’d like to be transparent about the decisions we made and, perhaps, spur a discussion about how to address scenarios like this in the future.

_____________________________________________

Take a look at what MALLET spit out: there are some errors.
Stuff like “xe, xc, zijn, xb, en, la”.

Yeah, I saw that.
We can make a custom stoplist to remove those.
Make a list and we’ll add it.

Interesting keys though – did you see that “women” came up in #0?
I’m excited to see this once it’s all graphed.

There’s some stuff though that I’m not sure we want to remove: Is CAA an error or an abbreviation? “socal” – is that social misspelled or Southern California, abbreviated?
Hmm, that could be an organization..

Yeah, SoCal, as in Southern California.
There was a camp there.
This adds a larger question: do we remove misspellings?
For clarity?

I have mixed feelings.

Me too.

I think it’s appropriate to remove backend stuff-
tags and metadata, but content is not something we should modify.

I agree.
We don’t want to skew the results.
Some of it occurred when we stripped all the alphanumeric stuff out.
It took out apostrophes- causing words like “I’ve” to become “I ve”.

The errors in themselves are telling about the nature of THATCamps.
That the content is generated spontaneously
lends itself to deviations from appropriate spelling ect.

I agree

Look at #17 and 18.

Whoa, where did “humanidades” come from?
Oh, right! There are international conference posts in here too!

How do we handle this?
Its possible to strip out the camps that are not in English
or even to run analysis on them separately.
I don’t want to skew the results but this also throws things off.

I know.
I say we leave it. It shows a growing international influence.
We’ll be able to see the emergence of International THATCamps.

I’ve never run into this before.
It brings up some interesting issues -
I wonder if there is standard procedure for something like this.

What about things like “en” which is Spanish for “is”-
that would have been removed on an English stoplist.
And now function words in Spanish and French
seem to appear more frequently because
the English terms have been filtered out.
How do you do topic modeling with multiple languages?

What about special characters?
We’ve stripped stuff out, how about how that would affect the appearance of words?

We have to find a stoplist for each of the languages.
To strip out the function words of all of them.

Good call. This got complicated quick!

Agreed.

_____________________________________________

As outlined above, when we opened the text file with keys, new questions were raised about the relevance and complications of running a particular stoplist on a corpus of texts. Similarly, we were forced to rethink how we handle misspellings and unfamiliar abbreviations. In the end, we tracked down stoplists for Spanish (and French) terms, so that no function words in any language would skew the results of our analysis. We also carefully examined the keys to identify abbreviations and misspellings and decided that they are a significant contribution to the analysis.

A few questions remained for us: how might removing non-alphanumeric characters (a-z,A-Z,0-9) alter the meaning of special characters used in languages other than English? How have others responded to spelling errors? How significant are errors?

Hopefully a post of this nature will foster discussion and produce a stronger, more complete analysis on this and other documents.

Pre-processing Text for MALLET

In our previous post, we described the process of writing a python script that pulled from the THATCamp MySQL Database. In this post, we will continue with this project and work to clean up the data we’ve collected and prepare it for some analysis. This process is known as “pre-processing”. After running our script in the THATCamp database all of the posts were collected and saved as text files. At this stage, the files are filled with extraneous information relating to the structure of the posts. Most of these are tags and metadata that would disrupt any attempts to look across the dataset. Our task here was to clean them up so they could be fed into MALLET. In order to do this, we needed to strip the html tags, remove punctuation, and remove common stopwords. To do this, we used chunks of code from the Programming Historian’s lesson on text analysis with python and modified the code to work with the files we had already downloaded.

Continue reading

Extracting Data from the THATCamp Database Using Python and MySQL

This week we’ve continued to work on building a python script that will extract all of the blog posts from the various THATCamp websites. As Jannelle described last week, our goal was to write a script that downloads the blog posts in plain text form and strips all of the html tags, stopwords, and punctuation so that we can feed it into MALLET for topic modeling and text analysis. After several long days and a lot of help from second year fellow Spencer Roberts, we’ve successfully gotten the code to work.

Continue reading

Spring Semester in Research and a THATCamp Challenge

The spring semester is here and the first year DH fellows have begun our rotation into the Research division of CHNM.

To get the ball rolling, we spent a week working through the helpful tutorials at the Programming Historian. As someone new to DH, with admittedly limited technical skill and knowledge, these were immeasurably useful. Each tutorial breaks content into smaller, less intimidating units. These can be completed in succession or selected for a particular topic or skill. While there is useful content for anyone, we focused our attention on Python and Topic Modeling with the aim of solving our own programming dilemma.

Our central challenge was to extract content across the THATCamp WordPress site to enable us to do some text analysis.

Continue reading

Public Projects: Reflection

Our first semester at the Center for History and New Media has flown by. We spent the second half of the semester in the Public Projects Division which was a diverse and rewarding experience.

During this rotation we were able to tour the entire division and spend some time working with many of the division’s projects.  We spent a large chunk of time working with Omeka, testing plugins, themes, and other items that are in development.  One thing I took away from working with the Omeka team and attending the Sprint Planning meetings is how collaborative this division, and the center as a whole, is.  Between programmers, designers, testers, and content development– Omeka really is a team project that seeks to make collecting easier for museums and archives.  Through working with the software we also got some hands on experience with the amount of work it takes to build an archive and what kinds of issues come up when doing so.  We discussed and experienced issues such as the naming of pages and areas on a site, creating a strict vocabulary to make searching consistent, and developing content first hand.

We also spend time developing content for projects such as The Histories of the National Mall and Papers of the War Department.  The National Mall project allowed us to think about how the public utilizes mobile history sites when at a museum or a national park such as the Mall.  We spent a wonderful afternoon down on the mall testing the mobile first site (and enjoyed some excellent tacos from the local food truck tacos!).

Papers of the War Department was a different experience and we spent time both transcribing documents and tagging meta data for documents. Using the Scripto plugin for Omeka, we first tagged revisit documents with key words, names, places, and topics.  This element of the project required some knowledge and required a deeper engagement with the documents than transcribing did.  Transcribing the documents was challenging (seventeenth century handwriting is interesting) but we could all see the immense benefit to having the documents both transcribed and tagged on the site.

I think we are starting to really begin to understand the inner workings of the center and the projects and goals of each division.  Public Projects does several different things from software development to content based projects and I think we all benefited greatly from our tour around the division. Coincidently, the first year fellows were also taking Clio Wired I this semester and often what we did at the center overlapped with what we did in class making the experience even more valuable for us.  I think we all came away from this semester having learned a great deal and I feel much more aware of many of the issues facing scholars in Digital History centers as well as in academia in general.

Reflections: Year Two, Semester One

As the first term of 2013-14 closes, it seems appropriate to reflect on the experiences of the Digital History Fellows. Last year, our first cohort of DH Fellows spent the first semester meeting with Dan Cohen, learning the history of the center, discussing current projects, and thinking about how digital history is practiced. We spent our second semester working in each of the divisions for five weeks, and then decided in which division we would like to work in the second year. Although there was no specific requirement that we take positions spread across the three divisions, we were drawn in different directions. From the first days of the fellowship, Ben Hurwitz was most comfortable in Education and quickly entrenched himself at their community table. He now works on various educational projects, including the Popular Romance Project. Amanda Morton worked closely with Fred Gibbs before he relocated to New Mexico, which helped her transition into Research, where she works on Digital Humanities Now and related PressForward projects. Spencer Roberts was drifting toward Public Projects before the summer started, and settled in once the center received a grant to work with the National Park Service to revamp their War of 1812 site.

This year we welcomed three new members into the fellowship, bringing our total number to six. The second cohort follows a different schedule in their first year, so Amanda Regan, Anne Ladyem McDivitt, and Jannelle Legg stepped directly into the mix at RRCHNM, splitting their semester into seven-week blocks in Education and Public Projects. During those weeks, they have written reflective posts about the projects to which they’ve contributed, all of which can be found here. Next term, they will spend a block in Research before moving into a final seminar with Stephen Robertson.

Continue reading

Public Projects: Reflection

The past seven weeks have moved really quickly but I have benefitted a great deal from the time we spent in the Public Projects section of CHNM.

Due to my relatively limited technical skills, this section has proven to be the most challenging thus far. However, with some help, and some pretty detailed instructions, I have been expanding my skill set and feel a lot less intimidated by the tools we work with. There are three main projects on which we focused: testing updates in Omeka, transcribing and revisiting documents at the Papers of the War Department and contributing to and testing the National Mall site.

I have deeply enjoyed them all, especially the sunny morning we spent at the National Mall. Additionally, a great deal of our work overlapped with the theoretical reading and discussions of our coursework as digital history scholars. It is rare for theory and application to be balanced, but that was definitely my experience this semester. I was frequently surprised to find applications of class reading at work and often referred to the work done at CHNM during course discussions.

Public Projects was deeply inclusive for us as fellows. I got a real sense of each of the ongoing projects and I learned a great deal about the collaborative work required to produce the resources described above.

Overall, this semester the fellowship has given me a structured place to develop my knowledge and expertise with digital tools, like Omeka and Scripto, and given me a sandbox to play with Git Hub and the command-line (if you know what those things are, you are in a much better place than I was three months ago!)

I’m looking forward to learning more in the semester to come!

Reflections on Public Projects

This week, we finished our rotation block with Public Projects. I both struggled and thoroughly enjoyed working in Public Projects, as I learned so many new and helpful things while I also found my weaknesses in some of the more technical aspects of digital history. This block included many different types of projects, such as live testing a new website at the National Mall, writing entries for that project, testing Omeka, and even transcribing letters for Papers of the War Department.

I also got to venture into DC for the first time for work during this rotation, which I enjoyed immensely. I was very thankful that I got to test the new National Mall project with my other first year fellows, and you can read more about that experience here. I am excited to see it go live, and I hope that when it is live, many other first-time and returning visitors to the Mall can utilize it.

I also had some difficulties in the block that I overcame, which makes me feel incredibly accomplished. Although I felt comfortable with Omeka coming into this block, I have learned so much more about how it functions and the different uses than I had previously known. I also learned a lot about how transcribing and pulling out keywords from handwritten letters are entirely different experiences. This was difficult, especially figuring out what particular words were, but it was so useful, connecting, and interesting to read these letters from when the US was a brand new country.

I loved working within this block, and I liked that I was challenged by a lot of the projects we worked on. I have learned a lot of useful skills that I can apply to my future career or dissertation as a historian. Coming into George Mason University, I already had my MA in Public History, and I have a real passion for making history accessible to the public. I believe that a lot of the work that is being done in the Public Projects section of CHNM is applying this concept, and I take great inspiration from the people and projects that I have encountered while working here.

A Bit of Reflection on Pressforward Projects

It’s interesting to be on the other side of the production of something like DHNow/JDH. Not only does sorting through material for each offer a unique opportunity to explore current events and conversations in the digital humanities, but this process also encourages deeper examination of blog posts and white papers to pull out threads of argument and evidence that can be used to connect disparate conversations across fields. Archaeologists and manuscript historians share common interests with those working in hard sciences and linguistics, although their work is rarely presented in the same forum. Part of what JDH adds to the DH community is this willingness to collect and edit work from across several disciplines and present them as part of a united DH culture.

I’ve learned, as a graduate student working on these projects, that being a part of this collecting and collating work requires a willingness to explore a wide-range of interests, and to read blog posts, white papers, and poster projects that have little to do with my own projects or areas of expertise. For example, most of the content for JDH comes from the pool of content chosen for Editors’ Choice features on DHNow, a selection process that requires Editors-in-Chief for a chosen week to read through content nominated by a group of editors-at-large whose experience in the DH community is variable. The job of the EC is to sort through these nominations, pulling out relevant job postings, conference and event announcements, calls for participation, and useful resources, then picking one or two items to feature as the Editors’ Choice for the Tuesday and Thursday of that week.

The selection of these Editors’ Choice items is left largely up to the EC for the week. There are guidelines, of course. These featured items need to be of substantial length, usually more than 500 words or 20 min. in video/presentation playback, and should make a relevant, substantive, and perhaps even provocative argument that adds to or initiates a conversation in the field. Since DHNow only links to these posts — there’s no editing involved — they should also be well-written and, if necessary, thoroughly cited. White papers and articles are generally only posted if they haven’t been published in other journals or periodicals.

While these guidelines are helpful, on good weeks Editors-at-Large nominate several pieces that meet the requirements, leaving final selection up to the EC for the week. Each of us have our own idiosyncrasies, of course, and our own areas of interest can influence our choices. We do also take into account how many times our options have been nominated, and we pay attention to that additional level of interest as well as checking for comments (in the PressForward plugin) that explain why our guest editors nominated individual items. What results is a crowd-sourced, yet still curated, publication that feeds into JDH.

Recent changes to the DHNow site — in both the sections dedicated to the Editors-at-Large and the main content pages — will hopefully encourage our guest editors to engage more in the content selection process. It will be interesting to see if new editors (and returning participants) start to leave more comments or more feedback to provide us with a better understanding of how they are selecting content to nominate. The other reason behind the redesign, beyond helping out current editors, was to pull in more outside editors. The more participants we have, the more feeds are nominated to be added to the plugin, and the more exposure both we and our editors have to the ongoing conversations and arguments circulating within the DH community. By encouraging the creation of a more engaged community, we are also pushing for more interdisciplinary participation in the field, bringing scientists, librarians, archaeologists, archivists, historians, and others into a community whose make-up should result in bigger and better projects and perhaps, a more solid sense of a DH identity.