|
|
Corpus
Architecture Diagram:
GIF
| |
Jump to:
Project Description | Capabilities
corpus will serve as the automatic classification system for
unilang, which is necessary to achieving the desired capability
of automatic message routing.
The concept of Adjustable
Autonomy is relevant here.
corpus now has a reasonable UI and is now successfully
classifying messages with a reasonable accuracy.
We are using
the rainbow - bayesian text classifier.
This has suprisingly
and astonishingly good results considering how little
information would appear to be present in the sentences.
However, it is not sufficient.
While it usually chooses the
correct category, the error rate is still too high, and to
disambiguate some of the weaker classes will require extra
information.
Therefore, I am looking to incorporate other
sources of classification evidence, based on features recognized
by other external codebases.
Other features that will be added are as follows.
Have the
ability to vet the automatic classifications.
A type system
will be created.
Recipient agents can reject messages which will
help with classification.
Incorporate mass verification and
classification adjustment and subsequent message
reclassification.
The next paragraph shows a very preliminary classification
example, and the current scheme (ranked in terms of probability
associated with example message).
Note that the classification
is exactly correct.
The scheme system will be greatly revamped
allowing a subsumption hierarchy and will also focus more on
what the actual routing commands are.
So for instance, rather
than "goal", we would have "(agent: verber) (new-goal $1)" or
rather than just "icodebase-capability-request", have "(agent:
myfrdcsa) (capability-request verber $1)".
I.E. the
responsible agent and the corresponding command to be sent.
(((Forgot to pick up pay check - need to go pick that ASAP.)))
observation 0.441955
verber-task-definition 0.244548
complex-statement 0.118441
0) Finished
* 1) observation
* 2) verber-task-definition
3) complex-statement
4) icodebase-solution-to-extant-problem
5) icodebase-capability-request
6) event
7) icodebase-input-data
8) dream
9) solution-to-extant-problem
10) system-request
11) policy
12) priority-shift
13) quote
14) unclassifiable
15) intersystem-relation
16) SOP
17) funny-annecdote
18) unilang-client-outgoing-message
19) goal
20) icodebase-task
21) suspicion
22) not-a-unilang-client-entry
23) dangling-clause
24) capability-request
25) rant
26) icodebase-resource
27) propaganda
28) inspiring-annecdote
29) shopping-list-item
>
- /var/lib/myfrdcsa/codebases/internal/unilang/start corpus ELog unilang-Client
- Related http://research.microsoft.com/research/nlp/msr_paraphrase.htm to corpus matching.
- Can create a new perllib::TimeSeries::Segment using corpus::TDT
- Refactor ASConsole::corpus::DBMail to use ASConsole::corpus::IMAP as a base class.
- Setup centralized documentation corpus so that we can illustrate what documents we need to connect.
- /var/lib/myfrdcsa/codebases/internal/unilang/start audience broker clear corpus cso ELog manager OpenCyc pse unilang-Client
- Convert the GigaWord corpus into a server (if it even runs, I may need more memory... or just load a smaller portion of it.)
- Create our own annotated fallacy corpus.
- Prototype various nlp applications wrt corpus.
- Maybe have a template based pattern extractor for corpus events.
- Should complete a basic schedule system manually, using corpus to look things up.
- /var/lib/myfrdcsa/codebases/internal/unilang/start audience broker clear corpus cso ELog OpenCyc pse unilang-Client
- Just use the language generation feature to create a translation corpus
- Add duplicate detection to corpus feedback for unilang.
- Here is a good idea - use existing bug databases found on the internet as a labelled training corpus for the corpus classifier.
- Should use WSD on corpus entries and also run ExtractAbbrev.java on them to get abbreviations like Emacs unilang agent, run this through termios, and use that to boot strap translation.
Also do anaphora resolution by running it on chunks of aligned texts.
- Could classify corpus entries along emotional lines.
- Let us consider the corpus problem to multistrategy and complex, and therefore in need of being worked on.
- corpus should know the greater context of statements and thus h ave access to writings and estimates of when they were made.
- corpus can use a similarity weighting based on similarity and use that as evidence of related classifications.
- Get a handle of this corpus problem pretty soon.
Come up with some quantification for it.
- corpus obviously needs to take advantage of Weka's text mining capabilities!
- corpus can analyze temporal semantics of things like "tomorrow" and use that in relation to the assertion date.
- Delete all the incorrect classifications in the sql log when corpus classifier is working.
- Should have the corpus classifier allow multiple schemes for things like classified, non-classified (since if you are only thinking about 1 scheme you might forget these.)
- It is recording the messages send back from corpus to unilang-Client, not necessarily relevant.
- Now all we have to do is fix that data and worry about complexity/efficiency issues with corpus doing autoclassification.
We also have to add more elegant classification than bayes, as well as talking amongst agents to support better categorization and command execution from our notes.
- Great - corpus is working well, if a bit slow.
- Humorous facts, I have written so many systems now that there are name conflicts between them, for instance "Cyc-Mode" and "corpus manager" are both abbreviated cm.
- (Here is the todo for unilang/corpus/pse)
- Obviously corpus needs to be ported to SQL.
- Fex for corpus.../
- Code monkey ought to have a corpus of examples for learning of error messages, etc, all marked by their full environmental context.
- Calculate various statistics on the resume corpus, for instance, how many skills are listed on average, etc.
- corpus can determine when there is not enough information available for a given classifier to classify an Item, in which case it does various checks, defaulting to asking the user.
- TimBl for corpus?
- Slipper for corpus?
- Could use record linkage detection techniques for corpus as well as for Sorcerer
- Need to use corpus to classify email/aim logs for what to do with them.
- This thing I've made for Sorcerer sure looks useful, it is being used in Sorcerer, broker, should be used in job-search, and could be used in busroute, clear, corpus, critic, cso, digilib, and verber.
- corpus needs to have some significant natural language understanding abilities.
Also, might consider using chill or equivalent to generate Cyc representations.
- When corpus classifies texts, it can use the same classifiers, duh.
- I think that if you think about it, well, what we really want out of corpus is just the most specific command it feels comfortable with issuing.
I guess we run into the same problem I was facing somewhere else (maybe here), yeah it was here, about dangerous commands, etc. So, whereas we have only focused on classification (because that _seemed_ logical), now we probably want to model the command domain a little more carefully, going so far into it as the exact grammars.
Ultimately, I think what we're seeing is there has to be code to do the final evaluation.
- Obviously the classifications from corpus depend on the current state.
- Definitely to come up with protocol section, to ensure that things happen as we want.
Therefore, interface with verber there, for temporal planning, and also use Kissinger corpus.
- I want to get this corpus done.
However, without gourmet, how can I eat to finish these things.
- corpus can have modules for some of the more complex systems.
- Add a feature to corpus or whatever, that is able to index various "requirements" like files, and then destroy them.
- corpus should be sure to do things linearly.
That way, when manually classifying things, we can assert a "continuation" between entries if they represent the same topic.
- Use sentence splitting with corpus.
- using kissinger corpus, formalize domains they are discussing, and represent communication actions formally, classify them, etc.
- I realized that the same system that is holding up both corpus and RSR is also holding up gourmet.
An ontology editor of sorts.
- Check whether the corpus of ty[ing lessons for trr is large.
- This would be as a means to disambiguate items.
For instance - "create capabilities management system" - is this done?
Well, when corpus is done, it will be, but how does pse know that necessarily.
- One possible thing to do is write something for unilang that interfaces with the corpus classifier.
- corpus should probably use an ILP tools to learn more rules for better classification
- Now, we need to finish corpus and actually have planning working.
- We can record a corpus of bus questions, etc, from pedestrians for use in determining additional busroute functionalities
- A quick corpus thing would be a command like corpus --listall task
- The overview of the corpus algorithm is too slow and inaccurate.
- Areas for improvement with corpus.
It's quite clear.
The procedure I came up with isn't efficient but it is easily adaptable to an efficient procedure.
Two things - don't do n^2, rather only look at reasonably related documents - this will cut down dramatically on run time.
Second, make more efficient the sharedterm code, and lastly profile to see where spending all time.
- Use Math::PartialOrdering for subsumption hierarchies for corpus and possibly gourmet
- corpus might well take into consideration that mispellings tend to before the user hits return, since often they don't check.
- Should configure kbfs to start commenting on files.
For instance - konik_laird_ilp2004.pdf is related to corpus.
- Only after building OCR corpus.
- However now I'm too tired to do this.
So I need to do that and other related things, like build a better critic::Classifier type system for corpus, tomorrow.
But I must also eat tomorrow.
- I believe I should get something preliminary going for pse pretty soon - as exported from corpus, for now.
So that we can begin to get an agenda in place.
From this agenda, and from the interest mapping system, we can start getting activities going.
- In lieu of a complete solution, we can simply manually verify corpus auto classification results at the end of each cycle.
- Features to add to corpus - need to add system to determine whether a property was manually or automatically selected.
- If it's not obvious, unilang will be using corpus to classify the users entries.
- wow I got corpus working sort of.
That's good news.
It is now classifying unilang log messages,a dn doing a rather good job.
- Need corpus to list items by class.
- corpus needs to also store its classes.
- corpus needs a feature where it can automatically handle complex statements, as well as automatic classification, and lastly, a measure of when something is done being classified.
It should also allow reclassification using inherent distinctions present after the addition of new classes.
- corpus needs to allow renaming of classes.
- corpus needs to be easier to use (i.e. not have that windowing problem.)
- Maybe some explanation of the distinction between classes in corpus would be useful.
- corpus must first chunk, then classify.
- Okay, FEX can be used with corpus.
- Maybe corpus could handle formalization of everything - from verber and pse entries to?
- Come up with a set of targets for corpus.
For instance -
- Other important thinking, I cannot begin work on certain projects, for instance, the meetup client, until corpus is finished.
We can represent this as an HTN in opencyc.
- Also need unilang or corpus to record which messages have already been addressed.
- It would be nice if corpus classified various types of messages, and came up with a standard way of saying them, then a LM could be created.
- corpus can concatenate two entries in a row and see how much sense it makes as a test when a connection is suspected.
- You can use typing speed to (help) classify related thoughts in corpus.
- I can't wait to get working on mapping corpus events to actions that require verber actions.
It would also be nice to move on the area of integration with Cyc, supposing I can ever figure out how to work log files.
- Of course, make sure to use Conversation perl module in corpus analysis of unilang.xml
This page is part of the FWeb package.
It derives from the
Robotics Institute projects page.
Last updated Mon Jan 15 08:34:38 CST 2007
.
|