Pre-processing Text for MALLET

In our previous post, we described the process of writing a python script that pulled from the THATCamp MySQL Database. In this post, we will continue with this project and work to clean up the data we’ve collected and prepare it for some analysis. This process is known as “pre-processing”. After running our script in the THATCamp database all of the posts were collected and saved as text files. At this stage, the files are filled with extraneous information relating to the structure of the posts. Most of these are tags and metadata that would disrupt any attempts to look across the dataset. Our task here was to clean them up so they could be fed into MALLET. In order to do this, we needed to strip the html tags, remove punctuation, and remove common stopwords. To do this, we used chunks of code from the Programming Historian’s lesson on text analysis with python and modified the code to work with the files we had already downloaded.

Continue reading

Extracting Data from the THATCamp Database Using Python and MySQL

This week we’ve continued to work on building a python script that will extract all of the blog posts from the various THATCamp websites. As Jannelle described last week, our goal was to write a script that downloads the blog posts in plain text form and strips all of the html tags, stopwords, and punctuation so that we can feed it into MALLET for topic modeling and text analysis. After several long days and a lot of help from second year fellow Spencer Roberts, we’ve successfully gotten the code to work.

Continue reading

Public Projects: Reflection

Our first semester at the Center for History and New Media has flown by. We spent the second half of the semester in the Public Projects Division which was a diverse and rewarding experience.

During this rotation we were able to tour the entire division and spend some time working with many of the division’s projects.  We spent a large chunk of time working with Omeka, testing plugins, themes, and other items that are in development.  One thing I took away from working with the Omeka team and attending the Sprint Planning meetings is how collaborative this division, and the center as a whole, is.  Between programmers, designers, testers, and content development– Omeka really is a team project that seeks to make collecting easier for museums and archives.  Through working with the software we also got some hands on experience with the amount of work it takes to build an archive and what kinds of issues come up when doing so.  We discussed and experienced issues such as the naming of pages and areas on a site, creating a strict vocabulary to make searching consistent, and developing content first hand.

We also spend time developing content for projects such as The Histories of the National Mall and Papers of the War Department.  The National Mall project allowed us to think about how the public utilizes mobile history sites when at a museum or a national park such as the Mall.  We spent a wonderful afternoon down on the mall testing the mobile first site (and enjoyed some excellent tacos from the local food truck tacos!).

Papers of the War Department was a different experience and we spent time both transcribing documents and tagging meta data for documents. Using the Scripto plugin for Omeka, we first tagged revisit documents with key words, names, places, and topics.  This element of the project required some knowledge and required a deeper engagement with the documents than transcribing did.  Transcribing the documents was challenging (seventeenth century handwriting is interesting) but we could all see the immense benefit to having the documents both transcribed and tagged on the site.

I think we are starting to really begin to understand the inner workings of the center and the projects and goals of each division.  Public Projects does several different things from software development to content based projects and I think we all benefited greatly from our tour around the division. Coincidently, the first year fellows were also taking Clio Wired I this semester and often what we did at the center overlapped with what we did in class making the experience even more valuable for us.  I think we all came away from this semester having learned a great deal and I feel much more aware of many of the issues facing scholars in Digital History centers as well as in academia in general.

Reflections on the Education Division

On Monday the first year fellows leave the Education Division and move to Public Projects for the remainder of our first semester.  Over the last seven weeks, I have learned a lot about the projects in the education division, the project and tools within the division, and the division’s goals of providing teachers with skills and tool to teach historical thinking to students.  I’ve come away from this rotation with a better understanding of not only the role of the education division but also with a new appreciation for the challenge of using and creating tools that encourage students to think critically about history.

Continue reading

Introduction to CHNM–Amanda Regan

Prior to arriving at George Mason University, I had some experience with Digital History and as a result was familiar with the Roy Rosenzweig Center for History and New Media (RRCHNM).  I earned my masters degree at California State University San Marcos where I took several digital history courses.  It was in these courses that I first became familiar with RRCHNM and the digital history projects that it had created. Looking at the center from the outside, it was hard to get a grasp on exactly how it operated and what kinds of things went on in the center on a daily basis.

Continue reading