| 1 | ======= |
|---|
| 2 | indexer |
|---|
| 3 | ======= |
|---|
| 4 | |
|---|
| 5 | The indexer provides: |
|---|
| 6 | |
|---|
| 7 | - client-side modules : API for client to ask for indexations and query the |
|---|
| 8 | Xapian database. When an indexation is asked, it is stored in a sql |
|---|
| 9 | database; |
|---|
| 10 | |
|---|
| 11 | - server-side application: a standalone thread that indexes what has been |
|---|
| 12 | asked by reading the sql database. |
|---|
| 13 | |
|---|
| 14 | |
|---|
| 15 | Let's import the modules used by the client-side:: |
|---|
| 16 | |
|---|
| 17 | >>> import indexer |
|---|
| 18 | >>> import searcher |
|---|
| 19 | |
|---|
| 20 | Let's reset the SQL DB first:: |
|---|
| 21 | |
|---|
| 22 | >>> indexer.reset() |
|---|
| 23 | |
|---|
| 24 | Let's also reset the Xapian DB:: |
|---|
| 25 | |
|---|
| 26 | >>> from xapindexer import force_reset |
|---|
| 27 | >>> force_reset() |
|---|
| 28 | |
|---|
| 29 | The Xapian DB should be empty now:: |
|---|
| 30 | |
|---|
| 31 | >>> searcher.corpus_size() |
|---|
| 32 | 0 |
|---|
| 33 | |
|---|
| 34 | Indexation |
|---|
| 35 | ========== |
|---|
| 36 | |
|---|
| 37 | Each indexable content has a unique id, and a text to index:: |
|---|
| 38 | |
|---|
| 39 | >>> uid = '1' |
|---|
| 40 | >>> text = 'my taylor is not rich anymore' |
|---|
| 41 | |
|---|
| 42 | Let's index it:: |
|---|
| 43 | |
|---|
| 44 | >>> indexer.index_document(uid, text) |
|---|
| 45 | |
|---|
| 46 | Another one:: |
|---|
| 47 | |
|---|
| 48 | >>> indexer.index_document('2', 'pluto is a dog') |
|---|
| 49 | |
|---|
| 50 | Let's start the worker that is in charge of asynchronous indexation:: |
|---|
| 51 | |
|---|
| 52 | >>> from xapindexer import start_server |
|---|
| 53 | >>> start_server() |
|---|
| 54 | |
|---|
| 55 | Let's wait a bit so the worker has the time to read the SQL Database |
|---|
| 56 | and do the work:: |
|---|
| 57 | |
|---|
| 58 | >>> import time |
|---|
| 59 | >>> while indexer.is_working(): |
|---|
| 60 | ... time.sleep(0.2) |
|---|
| 61 | |
|---|
| 62 | `is_working` looks in the SQL DB if there is some work left. |
|---|
| 63 | |
|---|
| 64 | The Xapian DB has two documents now:: |
|---|
| 65 | |
|---|
| 66 | >>> searcher.corpus_size() |
|---|
| 67 | 2 |
|---|
| 68 | |
|---|
| 69 | You could also use some fonctionnality of the librairy for pre-process the |
|---|
| 70 | text before the indexing and use the stemming alghorithm. |
|---|
| 71 | You just need to pass the iso language of the text into attributes:: |
|---|
| 72 | |
|---|
| 73 | >>> uid = '3' |
|---|
| 74 | >>> text = "Stemming is the process for reducing inflected (or sometimes"\ |
|---|
| 75 | ... " derived) words to their stem, base or root form." |
|---|
| 76 | >>> indexer.index_document(uid, text, language='en') |
|---|
| 77 | |
|---|
| 78 | We can also try with french sentence, with some accents:: |
|---|
| 79 | >>> uid = '4' |
|---|
| 80 | >>> text = "La lexémisation d'un mot est la fonction qui associe un"\ |
|---|
| 81 | ... " lexÚme à celui-ci." |
|---|
| 82 | >>> indexer.index_document(uid, text, language='fr') |
|---|
| 83 | |
|---|
| 84 | Let's wait a bit so the worker has the time to read the SQL Database |
|---|
| 85 | and do the work:: |
|---|
| 86 | |
|---|
| 87 | >>> import time |
|---|
| 88 | >>> while indexer.is_working(): |
|---|
| 89 | ... time.sleep(0.2) |
|---|
| 90 | |
|---|
| 91 | Searching |
|---|
| 92 | ========= |
|---|
| 93 | |
|---|
| 94 | Let's search now, with `searcher`. Operator is AND by default:: |
|---|
| 95 | |
|---|
| 96 | >>> res = searcher.search('rich') |
|---|
| 97 | >>> list(res) |
|---|
| 98 | ['1'] |
|---|
| 99 | >>> res = searcher.search('pluto') |
|---|
| 100 | >>> list(res) |
|---|
| 101 | ['2'] |
|---|
| 102 | >>> res = searcher.search('dog') |
|---|
| 103 | >>> list(res) |
|---|
| 104 | ['2'] |
|---|
| 105 | >>> res = searcher.search('rich dog') |
|---|
| 106 | >>> list(res) |
|---|
| 107 | [] |
|---|
| 108 | |
|---|
| 109 | Or operator:: |
|---|
| 110 | |
|---|
| 111 | >>> res = searcher.search('rich dog', or_=True) |
|---|
| 112 | >>> res = list(res) |
|---|
| 113 | >>> res.sort() |
|---|
| 114 | >>> res |
|---|
| 115 | ['1', '2'] |
|---|
| 116 | |
|---|
| 117 | Like the indexer, you could use the stemming fonction for search a word. |
|---|
| 118 | For exemple if you try to search the word `reducer` it will be refere to the |
|---|
| 119 | stem `reduc` like the word `reducing` in the exemple n°3:: |
|---|
| 120 | |
|---|
| 121 | >>> res = searcher.search('reducer', language='en') |
|---|
| 122 | >>> list(res) |
|---|
| 123 | ['3'] |
|---|
| 124 | |
|---|
| 125 | In french:: |
|---|
| 126 | >>> res = searcher.search('lexemiser', language='fr') |
|---|
| 127 | >>> list(res) |
|---|
| 128 | ['4'] |
|---|
| 129 | |
|---|
| 130 | We have an API to detect if a document is present:: |
|---|
| 131 | |
|---|
| 132 | >>> searcher.document_exists('2') |
|---|
| 133 | True |
|---|
| 134 | >>> searcher.document_exists('ttt') |
|---|
| 135 | False |
|---|
| 136 | |
|---|
| 137 | And another one to retrieve indexed terms:: |
|---|
| 138 | |
|---|
| 139 | >>> list(searcher.document_terms('2')) |
|---|
| 140 | ['dog', 'is', 'pluto'] |
|---|
| 141 | |
|---|
| 142 | Reindexation |
|---|
| 143 | ============ |
|---|
| 144 | |
|---|
| 145 | The document can also be reindexed:: |
|---|
| 146 | |
|---|
| 147 | >>> indexer.index_document('2', 'pluto is a cat') |
|---|
| 148 | >>> indexer.work_in_process() |
|---|
| 149 | ([u'2'], []) |
|---|
| 150 | |
|---|
| 151 | Let's wait a bit:: |
|---|
| 152 | |
|---|
| 153 | >>> while indexer.is_working(): |
|---|
| 154 | ... time.sleep(0.2) |
|---|
| 155 | |
|---|
| 156 | Let's make sure the document has been reindexed:: |
|---|
| 157 | |
|---|
| 158 | >>> list(searcher.document_terms('2')) |
|---|
| 159 | ['cat', 'is', 'pluto'] |
|---|
| 160 | |
|---|
| 161 | Then check the indexation has changed:: |
|---|
| 162 | |
|---|
| 163 | >>> res = searcher.search('rich dog', or_=True) |
|---|
| 164 | >>> list(res) |
|---|
| 165 | ['1'] |
|---|
| 166 | |
|---|
| 167 | Or deleted:: |
|---|
| 168 | |
|---|
| 169 | >>> res = searcher.search('pluto') |
|---|
| 170 | >>> list(res) |
|---|
| 171 | ['2'] |
|---|
| 172 | >>> indexer.delete_document('2') |
|---|
| 173 | >>> while indexer.is_working(): |
|---|
| 174 | ... time.sleep(0.2) |
|---|
| 175 | >>> res = searcher.search('pluto') |
|---|
| 176 | >>> list(res) |
|---|
| 177 | [] |
|---|
| 178 | |
|---|
| 179 | statistics |
|---|
| 180 | ========== |
|---|
| 181 | |
|---|
| 182 | We can also do a bit of statistics:: |
|---|
| 183 | |
|---|
| 184 | >>> #import stats |
|---|
| 185 | >>> #stats.query_count('pluto') > 1 |
|---|
| 186 | True |
|---|
| 187 | |
|---|
| 188 | And provide search suggestions, let's do a few search:: |
|---|
| 189 | |
|---|
| 190 | >>> searcher.search('platon') |
|---|
| 191 | <generator object at 0x...> |
|---|
| 192 | >>> searcher.search('plisser') |
|---|
| 193 | <generator object at 0x...> |
|---|
| 194 | |
|---|
| 195 | >>> #list(stats.query_suggestions('pl')) |
|---|
| 196 | [(u'pluto', ...), (u'platon', ...), (u'plisser', ...)] |
|---|
| 197 | |
|---|
| 198 | This is useful for example, to provide an ajaxified search box, |
|---|
| 199 | were we display suggestions as the user types... |
|---|