System Development Update
We have recently completed most of the functionality of the SoCQA tool. This includes ingesting the documents as annotated in Atlas-ti, learning a model from that data and reporting the model performance results, applying the model to additional data and allowing the user to verify whether the model predictions are correct or not.
For our pilot project that studies leadership in open source developers, the tool can ingest email documents in their html format; it can ingest the Atlas-ti exported XML representation of the coders' annotations, with the corresponding codebook of codes; and it can reconcile the text representations to get annotated text, which we represent in the LAF (Linguistic Annotation Framework) standard. We have implemented a User Interface (UI) for the social science researcher to manage the annotated documents and the presumably many more unannotated documents that they wish to analyze.
In the UI, the researcher can ask the system to learn a model from a selected group of annotated documents and to see resulting precision and recall for each of the codes and a confusion matrix showing which codes are correctly and incorrectly predicted for each other. The researcher can then apply this model to unannotated documents.
Our most recent milestone is to complete the verification part of the UI. Applying the model to unannotated documents results in a Machine Learned (ML) annotation consisting of the sentences of the documents labeled by the predicted codes of the ML model. The verification UI allows the researcher to verify these predictions. In our design, the researcher reads through the sentences predicted for a particular code and clicks on either "Yes" (the prediction is correct), "No" (the prediction is wrong), "Don't know" (allowing the researcher to punt) or "Don't code" (indicating that the text is extraneous junk and should not have any codes applied to it). The researcher can also click on the individual sentences, and the entire document pops-up in a window so that they can read the sentence in context if that is required in order to decide if the code is correct.
The design of the UI reflects the second of our two strategies for the researcher to respond to the results of the model on the annotated data. The first strategy is that the result of applying the model to a particular code may have unsatisfactory performance due to the small numbers of examples, e.g. in our pilot project, there were codes for which there were fewer than 5 examples in the annotated text. In this case, we use the strategy of active learning in which we will ask the researcher to provide more annotated examples on their own.
The second strategy is that if the model achieves reasonable performance, we have tuned that performance to give high recall, which means that we try not to miss any possible predicted codes. The penalty to high recall is low precision, which means that the model is over-optimistic and labels far too many sentences as having a predicted code.
Our initial experience is that this strategy is workable for the researcher. Even if the researcher has to read through 100 sentences and found only 3-4 that were correctly predicted, that this was much faster than for the researcher to continue annotation on their own, reading through documents in their entirety.
Going forward, we will continue to experiment with this strategy. The results of the human verification will be fed back into the model as what we are calling "silver" data. Like "gold" data, it contains annotations judged correct by humans, but unlike gold data, the silver data does not contain completely annotated documents, as the human researcher has not read through all the documents, only corrected the sentences which had predictions. As we continue with this, we will keep track of the accuracy in the model and the researcher time to achieve that accuracy.