| 1 | ==================================================== |
|---|
| 2 | Creating a local copy of a website with tagged words |
|---|
| 3 | ==================================================== |
|---|
| 4 | |
|---|
| 5 | :abstract: |
|---|
| 6 | |
|---|
| 7 | This tutorial will explain you how to create a local copy of web pages, |
|---|
| 8 | with tagged words. There's no particular prerequisites for this tutorial. |
|---|
| 9 | |
|---|
| 10 | .. contents: Toc |
|---|
| 11 | |
|---|
| 12 | The job is done in three parts: |
|---|
| 13 | |
|---|
| 14 | - reading the configuration file |
|---|
| 15 | - getting the content |
|---|
| 16 | - extracting the words to tag and write the files |
|---|
| 17 | |
|---|
| 18 | Reading the configuration file |
|---|
| 19 | ============================== |
|---|
| 20 | |
|---|
| 21 | Python provides a module that knows how to read *ini* files, that have |
|---|
| 22 | sections. Such file can look like this:: |
|---|
| 23 | |
|---|
| 24 | [webpages] |
|---|
| 25 | python=http://python.org |
|---|
| 26 | pycon=http://us.pycon.org |
|---|
| 27 | |
|---|
| 28 | [options] |
|---|
| 29 | path=/tmp |
|---|
| 30 | |
|---|
| 31 | [words] |
|---|
| 32 | cool |
|---|
| 33 | fun |
|---|
| 34 | neat |
|---|
| 35 | |
|---|
| 36 | `ConfigParser` knows how to handle these files, and can be used to provide |
|---|
| 37 | a simple access to them. |
|---|
| 38 | |
|---|
| 39 | -> learn `how to use ConfigParser` |
|---|
| 40 | |
|---|
| 41 | Getting the content |
|---|
| 42 | =================== |
|---|
| 43 | |
|---|
| 44 | The configuration is now read. Let's get the content over the web. `urllib2` |
|---|
| 45 | provides such a service, and let you suck a distant page content to a |
|---|
| 46 | workable string. |
|---|
| 47 | |
|---|
| 48 | -> learn `how to use urllib2` |
|---|
| 49 | |
|---|
| 50 | Extracting the words |
|---|
| 51 | ==================== |
|---|
| 52 | |
|---|
| 53 | The workable string can be parsed with a regular expression matcher, that |
|---|
| 54 | let you substitute a matching string with a value. In our case, the expression |
|---|
| 55 | would be each word listed in the `words` section of the file. |
|---|
| 56 | |
|---|
| 57 | -> learn `how to substitute a value in a string with *re*` |
|---|
| 58 | |
|---|
| 59 | Linking everything |
|---|
| 60 | ================== |
|---|
| 61 | |
|---|
| 62 | Our program works in 4 phases: |
|---|
| 63 | |
|---|
| 64 | - reads the `webpages` section of the configuration file |
|---|
| 65 | - get all web pages |
|---|
| 66 | - for each of them, modify the words pointed by the configuration list by |
|---|
| 67 | a bolded word |
|---|
| 68 | - save them in text files, in the `path` given. |
|---|
| 69 | |
|---|
| 70 | |
|---|
| 71 | |
|---|