Email2wiki
From Knowledge Discovery
Abstract
In the past 20 years, email has risen to become one of the primary methods of communicating between individual members of project teams. However email is still an imperfect communication medium within a team environment. While it is largely a private communication between two entities, the topics addressed are often pertinent to the entire team. Current attempts to solve this problem have revolved around either manually entering the important questions and answers onto another platform or using a much more structured and restrictive piece of project management software.
The goal of this project is to refine the process of transferring important information from a private email conversation to an editable wiki visible to the entire project team. The eventual software system will eliminate emails that are merely procedural (meeting scheduling etc) and those that are confidential (i.e. not meant for team use) and insert the rest into a series of structured wiki pages, organized by the main topic threads that run through the email correspondence.
Technical Approach
One advantage of this project is that it lends itself to easy modular separation. Each module can then be individually coded and tested in a true object oriented fashion. The language of choice in this project will be Python, with a small amount of mySQL possibly needed for the wiki database work.
The first module will consist of Python code that will take an email sent through Mailman and begin to parse out important information regarding the email’s state: its sender, recipients, timestamp, subject and body all will be cataloged for use in later stages of the email2wiki process.
In contrast to the above module, the second module, entity resolution and tagging, will be a substantial undertaking. Entity resolution is complex due to the intrinsic ambiguity found in natural languages. In order to decide what an entity is (a person, place, program, etc), this module must take into consideration the context in which the entity is located. The context consists of words surrounding the entity as well as other words present in the text. A statistical analysis will need to be performed in order to assess the probability a particular entity is of a specific type. This analysis, combined with information gleaned from the identity of the sender and recipient should allow for either one or multiple entities to be identified within a single email message.
After entity resolution and tagging, the next module to be constructed is one that separates the body of emails into separate segments if necessary. Since emails can consist of multiple topics to be archived, it is important to have a mechanism to separate the text in a logical way and allow each main topic to be entered into the wiki database separately. Assuming the entity recognition module works correctly, this piece of Python code will be relatively simple to construct, and can use language constructs (paragraphs, sentences etc) to split the message body into segments based upon the entities it contains.
One easy application of tagging entities is that wikilinks can be made between various pages. These links will allow users to navigate to new pages without these pages having tightly crafted titles.
Now that each main thread has been separated out of the message body, the next module can look through the existing wiki pages and check to see if one exists that matches the topic referenced in the message segments and if not, create a page that will eventually be filled with the given topic. This code will involve searching the wiki pages that are stored in a mySQL database and thus might require a combination of SQL and Python to complete.
Once the page that will hold the message segment has been identified, the final module will consist of summarizing the segment and placing it into the wiki page, perhaps modifying an existing perl script. To summarize data, code will be developed that builds upon previous versions of summarization programs. This portion will require more research and an understanding of the concepts of effectively summarizing messages specifically contained within email. After this summarization is complete, the new text fragment will be inserted into the identified wiki page and opened to editing from any team member.
open source under the same dual licence as mySQL.
