. Search the site
FRDCSA | internal codebases | CoAuthor

CoAuthor

Architecture Diagram: GIF

Jump to: Project Description | Capabilities

Project Description

In practice we have implemented various techniques that approximate the methodology mentioned below. One such technique is the determination of text dependency information. (There happens to be a very useful Perl module called Math::PartialOrdering, for dealing with partial orderings.) We used a weighting similar to TFIDF in order to extract terms that were shared mostly between two documents and few others, i.e. $shared->{$t} = 4 * ($doc->{$a}->{$t} * $doc->{$b}->{$t}) / ($tf->{$t} * $tf->{$t}). We hypothesized that if a text A is a prerequisite of a text B, that the distribution of first occurances of shared terms in A would be more equiprobable, whereas in B they would be closer to the front. We thought this since we thought that if A was defining the terms they would introduce them more gradually, whereas B presumed knowledge and therefore did not hesitate to use them. So our prerequisite measure is simply the difference between average percent position of first occurances of shared terms in A and B. This intentionally simplistic measure seems to have given at least initially usable results.

This capability will be used by the clear ITS to provide constraints on the reading of large groups of documents, i.e. digilib. The quiz capability will be used to determine familiarity with given material both for placement and advancement. Consideration will be given to the goal based justification for reading texts. So, one general functionality of coauthor is to write custom texts for individual learners based on clear placement/familiarity quizzes and interaction history, over large text collections contained in digilib. Even more applications are possible, such as generating appropriate documentation (overviews, manuals, etc) from disparate project sources, bootstrapping knowledge formation, intelligence analysis, or organizing various information sources.

The idea of a system that automatically sythesizes book formats is nothing new. We have many choices how to proceed. We could formalize all writings axiomatically and then use NLG to write the book, following some method for introducing concepts. This is close to what we have done, however, we do not have tools for completely representing these concepts. Initially we have chosen to perform a more shallow parse of the materials. This is because the structures required for shallow parse are prerequisites of those required for deeper parses.

Initially we focus on the organization of the source materials. Since we would like the process to be completely initiated by the computer, with the possibility of asking a complexity bound set of questions of the author, we ensure that the process is entirely automated. Automated essay scoring techniques (LSI) are applied to determine the important sections of the text. The corpus of texts to be integrated is sorted and clustered hierarchically. Sections and chapters are best fitted, and titles are extracted using a hybrid keyphrases/title recognizer and summarization.

A corpus of related materials is downloaded and weighted according to importance.

We ensure that all terminology is defined before being used, and this may involve restructuring the texts. This is done by terminology extraction followed by word sense disambiguation. The author selects the senses selected from Wordnet senses and automatically extracted definitions using Tom Pederson's WSD Perl modules. In this way we develop a large dictionary of concepts. We use various network centrality measures to compute the most important concepts. We then present them according to a format which has been obtained by performing the exact same analysis on related texts and mapping to their formats as represented in DocBook.

Some degree of reasoning must be applied in order to sort out conflicting information from these sources.

Spelling and grammar checking is employed at this stage.

We then employ a sentence level statistical rewrite of the text to enhance readability. We use the Halogen NLG system with a language model trained on books and papers in the domain and weighted according to the importance of the source text. We then use NLU to map to the interlingua and regenerate it.

Readability analysis as well as Latent Semantic Indexing is applied at this stage.

coauthor should be of assistance in writing any kind of documentation (FAQs/Howtos/Use Cases/Man pages/Papers/etc). It needs an interactive authoring environment that suggests completions, even operating at the rhetorical structure level. Could integrate with reasonbase, hey?

Capabilities

  • Well we got up to busroute and poked up to coauthor in our audit of useful stuff. Need to come up with more stringent classification criteria.
  • Create a publications section of fweb, from coauthor.
  • coauthor should beautify and declassify and generate these paper drafts.
  • I mean that coauthor could use.
  • Put statistical language modelling and correction software into coauthor.
  • coauthor can generate zooming descriptions of various text systems.
  • coauthor can help us write Man pages, or any kind of documentation for that matter.
  • Could use coauthor to generate cover letters for the resume.
  • For any text which coauthor uses to generate a book that is not by a particular author - coauthor can do the citation.
  • Need to get a centrifuser thing going for news texts - add it to coauthor
  • coauthor's dependency system can be used with clear in order to build mental models. Clearly, coauthor and clear are very closely related, which is very funny since I didn't even detect that at first. We need to get the coauthor system's ability to generate dependencies between written materials working so that - 1 - I can generate a doctrinal hierarchy for people to test proficiency in.
  • Use SPADE with coauthor
  • Add stuff to coauthor to automatically copy out the current book, to label it with the title, and put it in a collection, etc.
  • only publish ideas that you've written before we started on coauthor


This page is part of the FWeb package.
It derives from the Robotics Institute projects page.
Last updated Mon Jan 15 08:34:29 CST 2007 .