root / atomisator / doc / README.txt

Revision 102:917b0f7d3576, 1.8 kB (checked in by Tarek Ziad?? <tarek@…>, 15 months ago)

fixes

Line 
1==========
2atomisator
3==========
4
5This program takes feeds and inject them into a database,
6after a filtering.
7
8The first step is to read data::
9
10    >>> from sources import get_entries
11    >>> sources = (('rss', rss_file),)
12    >>> feeds = get_entries(sources)
13    >>> sample_feed = feeds.next()
14
15
16And to link to datas contained into the database::
17
18    >>> sqluri = 'sqlite:///tests/test.db'
19    >>> from filters import bayes
20    >>> bayes.SQLURI = sqluri
21
22    >>> from entries import Entries
23    >>> entries = Entries(sqluri)
24
25Then to run filters on it::
26
27    >>> from filters import run_filters
28    >>> qualified = [entry for entry in sample_feed
29    ...              if run_filters(entry, entries)]
30
31To decide wheter to inject them into the database::
32
33    >> entries.insert_entries(qualified)
34
35Then provide an output::
36
37    >> from outputs import write_output
38    >> print write_output('summary', entries).encode('utf8')
39    2007-05-30: Xavier Darcos programme la finde la carte scolaire
40    ...
41
42The cool thing about atomisator is the filtering. The defaut filterings
43run over the entries are:
44
45- levensthein, to avoid similar entries, even if they differ a little bit;
46- html remover, to avoid storing html tags into the database.
47- bayesian classifier, to decide whether an entry is interesting.
48
49The classifier can be trained (after un-htmlization)::
50
51    >>> from filters.bayes import bayesian_learn, bayesian
52    >>> from filters.unhtml import descape
53    >>> entries = list(get_entries(sources).next())
54    >>> for entry in entries:
55    ...     res = descape(entry, None)
56    ...
57    >>> bayesian_learn(entries[0], sqluri=None, answer='y')
58    >>> bayesian_learn(entries[1], sqluri=None, answer='n')
59
60Then guessed::
61
62    >>> bayesian(entries[0], None, sqluri=None)
63    True
64    >>> bayesian(entries[1], None, sqluri=None)
65    False
Note: See TracBrowser for help on using the browser.