root / PyCon07 / material / tutorial_9.txt

Revision 57:efd3c5bd5313, 1.8 kB (checked in by Tarek Ziad?? <tarek@…>, 21 months ago)

more on recipes

Line 
1====================================================
2Creating a local copy of a website with tagged words
3====================================================
4
5:abstract:
6 
7  This tutorial will explain you how to create a local copy of web pages,
8  with tagged words. There's no particular prerequisites for this tutorial.
9
10.. contents: Toc
11
12The job is done in three parts:
13
14- reading the configuration file
15- getting the content
16- extracting the words to tag and write the files
17
18Reading the configuration file
19==============================
20
21Python provides a module that knows how to read *ini* files, that have
22sections. Such file can look like this::
23
24    [webpages]
25    python=http://python.org
26    pycon=http://us.pycon.org
27
28    [options]
29    path=/tmp
30
31    [words]
32    cool
33    fun
34    neat
35
36`ConfigParser` knows how to handle these files, and can be used to provide
37a simple access to them.
38
39-> learn `how to use ConfigParser`
40
41Getting the content
42===================
43
44The configuration is now read. Let's get the content over the web. `urllib2`
45provides such a service, and let you suck a distant page content to a
46workable string.
47
48-> learn `how to use urllib2`
49
50Extracting the words
51====================
52
53The workable string can be parsed with a regular expression matcher, that
54let you substitute a matching string with a value. In our case, the expression
55would be each word listed in the `words` section of the file.
56
57-> learn `how to substitute a value in a string with *re*`
58
59Linking everything
60==================
61
62Our program works in 4 phases:
63
64- reads the `webpages` section of the configuration file
65- get all web pages
66- for each of them, modify the words pointed by the configuration list by
67  a bolded word
68- save them in text files, in the `path` given.
69
70
71   
Note: See TracBrowser for help on using the browser.