We have spent the last few weeks working to build a python script that would allow us to download and prep all of the THATCamp blog posts for topic modeling in MALLET (for those catching up, we detailed this process in a series of previous posts). As our last post detailed, we encountered a few more complications than expected due to foreign languages in the corpus of the text. After some discussion, we worked through these issues and were able to add stoplists to the script for German, French, and Spanish. Although this didn’t solve all of our issues and some terms do still show up (we didn’t realize there was Dutch too), it led to some interesting discussion about the methodology behind topic modeling. Finally we were able to rerun the python script with the new stopwords and then feed this new data into MALLET.
As described in previous posts, the first year Digital Fellows at CHNM have been working on a project under the Research division that involves collecting, cleaning, and analyzing data from a corpus of THATCamp content. Having overcome the hurdles of writing some python script and using MySQL to grab content from tables in the backend of a WordPress install, we moved on to the relatively straightforward process of running our stripped text files through MALLET.
As we opened the MALLET output files, excited to see the topic models it produced, we were confronted with a problem we didn’t reasonably anticipate and this turned into a rather important discussion about data and meaning.
In our previous post, we described the process of writing a python script that pulled from the THATCamp MySQL Database. In this post, we will continue with this project and work to clean up the data we’ve collected and prepare it for some analysis. This process is known as “pre-processing”. After running our script in the THATCamp database all of the posts were collected and saved as text files. At this stage, the files are filled with extraneous information relating to the structure of the posts. Most of these are tags and metadata that would disrupt any attempts to look across the dataset. Our task here was to clean them up so they could be fed into MALLET. In order to do this, we needed to strip the html tags, remove punctuation, and remove common stopwords. To do this, we used chunks of code from the Programming Historian’s lesson on text analysis with python and modified the code to work with the files we had already downloaded.
This week we’ve continued to work on building a python script that will extract all of the blog posts from the various THATCamp websites. As Jannelle described last week, our goal was to write a script that downloads the blog posts in plain text form and strips all of the html tags, stopwords, and punctuation so that we can feed it into MALLET for topic modeling and text analysis. After several long days and a lot of help from second year fellow Spencer Roberts, we’ve successfully gotten the code to work.
The spring semester is here and the first year DH fellows have begun our rotation into the Research division of CHNM.
To get the ball rolling, we spent a week working through the helpful tutorials at the Programming Historian. As someone new to DH, with admittedly limited technical skill and knowledge, these were immeasurably useful. Each tutorial breaks content into smaller, less intimidating units. These can be completed in succession or selected for a particular topic or skill. While there is useful content for anyone, we focused our attention on Python and Topic Modeling with the aim of solving our own programming dilemma.
Our central challenge was to extract content across the THATCamp WordPress site to enable us to do some text analysis.
This week, we finished our rotation block with Public Projects. I both struggled and thoroughly enjoyed working in Public Projects, as I learned so many new and helpful things while I also found my weaknesses in some of the more technical aspects of digital history. This block included many different types of projects, such as live testing a new website at the National Mall, writing entries for that project, testing Omeka, and even transcribing letters for Papers of the War Department.
I also got to venture into DC for the first time for work during this rotation, which I enjoyed immensely. I was very thankful that I got to test the new National Mall project with my other first year fellows, and you can read more about that experience here. I am excited to see it go live, and I hope that when it is live, many other first-time and returning visitors to the Mall can utilize it.
I also had some difficulties in the block that I overcame, which makes me feel incredibly accomplished. Although I felt comfortable with Omeka coming into this block, I have learned so much more about how it functions and the different uses than I had previously known. I also learned a lot about how transcribing and pulling out keywords from handwritten letters are entirely different experiences. This was difficult, especially figuring out what particular words were, but it was so useful, connecting, and interesting to read these letters from when the US was a brand new country.
I loved working within this block, and I liked that I was challenged by a lot of the projects we worked on. I have learned a lot of useful skills that I can apply to my future career or dissertation as a historian. Coming into George Mason University, I already had my MA in Public History, and I have a real passion for making history accessible to the public. I believe that a lot of the work that is being done in the Public Projects section of CHNM is applying this concept, and I take great inspiration from the people and projects that I have encountered while working here.
My time in the Education block of the Center for History and New Media as a Digital History Fellow has been quite interesting for me. Previously, my experience with teaching was limited to either working as a Graduate Teaching Assistant for introductory-level history courses or teaching fourth graders as a Public History Educator at a museum in Sanford, Florida. Due to my admittedly limited experience with K-12 education, this experience has been revealing on how technology can accommodate teaching history to students at those levels.
Although historians always analyze information and primary documents, it is a lot more difficult to determine the best way for students to utilize those resources for learning. For example, while writing reviews for Teaching History, I had to consider the typical things for historians, such as bias, type of information, and quality and quantity of the primary documents. What is new to me is that I also had to think of how these items could potentially enhance a lesson plan for a teacher for their class. In addition, I also had to consider the usability of these websites and tools. If a website is too difficult or confusing for a student to use, then it is problematic to consider it a valuable teaching resource, even if the information is good.
I have previously mentioned the challenges of thinking as an educator, and these challenges continue to be something that I must tackle as I continue in the educational portion of CHNM, as well as my future as a historian. I believe that these are some of the valuable lessons that I can take form working at a Digital History Fellow at CHNM, because I will be able to utilize the skills that I have obtained from working on these projects in future endeavors.
On Monday the first year fellows leave the Education Division and move to Public Projects for the remainder of our first semester. Over the last seven weeks, I have learned a lot about the projects in the education division, the project and tools within the division, and the division’s goals of providing teachers with skills and tool to teach historical thinking to students. I’ve come away from this rotation with a better understanding of not only the role of the education division but also with a new appreciation for the challenge of using and creating tools that encourage students to think critically about history.
My time in the education department at CHNM has passed quickly, but it has also been deeply enriching. I’ve learned a lot about the challenges of creating historical scholarship geared toward K-12 students and have come to appreciate the importance of integrating digital media in the classroom. As one can imagine, coming into the Center with limited technical skills can be intimidating, but in these seven weeks the combination of course content and fellowship activities has greatly reduced my concerns.
For the past few weeks at the Center for History and New Media, my fellow first year Digital History Fellows and myself were assigned to work in the Education division, which produces projects that are designed to teach history to a wide scope of people through various educational resources. While in the Education division, we have been working with a new web project meant to engage and educate the audience by allowing them to examine liberty in the United States in a new and interesting way. This is achieved by incorporating age and ability-appropriate “challenges” and access to primary documents and images. This project seeks an audience of teachers, K-12 students, as well as the general public.
There are intriguing methods in creating a challenge for students. While creating our own challenge for the project, there were multiple questions that we had to ask ourselves. First, what was the goal of the project? What did we want the students to achieve from doing the challenge? What skills would they use? In terms of examining the sources, we attempted to view them in an analytic manner, but with a basic guided direction so that the students do not get overwhelmed. We wanted the students to come away with an understanding of the importance of understanding not only the document itself, but also their context. By giving the students a choice of what documents they could utilize for their own project, it allows them to view our examples and use the skills they gained to create an interesting project from their understanding.
Although this project has yet to publicly launch, I have been testing the website from multiple angles to ensure that it will work properly for the end users. This has certainly been a fun process for me, as I have had to work as both a teacher and a student! This meant that I had to get myself into a mindset of, “if I were in tenth grade, how would I have completed this assignment? What did I know? What did I not know?” It was also quite engaging to utilize the primary documents and photographs in conjunction with the provided tools to create interesting projects with the website. I would imagine that K-12 aged students would also find this to be quite exciting, but I also think that it would be a fun experience for teachers who are designing challenges for their students, as well. I know all of the DH Fellows that worked on this project took our assignments very seriously beyond just the testing phase, as we worked for hours to perfect our challenge assignments!
Originally posted on Center for History and New Media Blog