One of the things I learned from our January code sprint is that DrProject's existing wiki parser has reached, or passed, the point of no return. What started off in Trac as an almost-comprehensible list of regular expressions has grown into a tangle of special cases so sensitive that any change anywhere breaks something somewhere else.
My first
thought was to write a new parser for the same syntax using ANTLR. However, my
minions students have convinced me that switching to
some other syntax---something that's already widely used---would be a
better idea. After a bit of browsing, we settled on Markdown, for
which there are two working parsers for it in Python: Markdown-1,
written by Manfred Stienstra, Yuri Takhteyev, and Waylan Limberg, and Markdown-2,
written by ActiveState's Trent Mick.
So how should I go about choosing which one to use? The first step is to list their strengths. Markdown-1 was written from scratch in Python; Markdown-2 is a fairly direct translation into Python of the Perl Markdown parser. (For example, some of the comments still use Perl notation for regular expressions.) Markdown-1 has been written by three people (lots of eyeballs are good); Markdown-2 by one (a single style is also good). They're similar in length (roughly 1900 and 1680 lines respectively); Markdown-2 claims to be faster and better tested, but no figures are provided on its web site.
Step two: contact the authors. Are both projects still alive, or has either been abandoned? What are their plans for future development? Would they be interested in helping us integrate their code into DrProject? And what do they think of the other project?
Answers came back pretty quickly, and were encouraging: both projects are very much alive. That left step three: read the code. I'd like to believe that we'll be able to treat whichever one we choose as a black box, but past experience tells me that sooner or later (probably sooner), we'll have to dive in and either fix bugs or rearrange things to make it possible to add some extension we absolutely, positively need. I actually did this reading as I was talking via email with the two libraries' authors; they cleared up some questions I had, and Trent even went so far as to make and check in a change that I suggested.
After spending about an hour and a half with each library over the course of a couple of days, I decided to go with Markdown-1. Markdown-2 performance is definitely faster:
$ python perf.py -r 10 tmp-test-cases Time conversion of tmp-test-cases/*.text (plat=darwin): Markdown.pl: best of 10: 0.372s markdown.py: best of 10: 0.295s markdown2.py: best of 10: 0.221s $ python perf.py tmp-aspn-cases Time conversion of tmp-aspn-cases/*.text (plat=darwin): Markdown.pl: best of 3: 5.842s markdown.py: best of 3: 18.920s markdown2.py: best of 3: 4.956s
but (a) our pages tend to be fairly short, and (b) I'm confident Markdown-1 can be tuned. What made me decide in its favor was that it goes from wiki text to DOM to HTML, while Markdown-2 goes straight from wiki text to HTML. This will probably make any structural extensions we want to add simpler to write---while we could always translate to HTML, then convert that to DOM, mess with the structure, and convert back to HTML, hooking into a text->DOM->HTML pipeline seems cleaner. I also found Markdown-1 easier to understand, which is undoubtedly more of a reflection on Markdown-2's Perl antecedents than it is on Trent's programming style ;-).
Now it's time for step 4: re-read the code in detail, and make sure that I understand it well enough to write the extensions we need (for example, one to turn the string "#123" into a link to a ticket). This is also a chance for me to check that I made the right decision: if anything particularly ugly turns up on closer reading, I can still go back to Markdown-2. Finally, I started to talk about code reviews to my undergraduate software engineering class this week; taking notes as I go through Markdown-1 will give me a worked example to put in front of them.
After a little reformatting, the source is 1858 lines long. I didn't read it from start to finish; instead, I jumped around, trying to build a mental model of how the pieces fit together. What follows is a reconstruction of that exploration; I'd welcome comments about things I may have missed, or how else you might have read the code.
The first thing I read is the header. The docstring tells me how to run the program as a standalone application, and where to look for more information:
0001: #!/usr/bin/env python 0002: 0003: version = "1.7" 0004: version_info = (1,7,0,"rc-1") 0005: __revision__ = "$Rev: 66 $" 0006: 0007: """ 0008: Python-Markdown 0009: =============== 0010: 0011: Converts Markdown to HTML. Basic usage as a module: 0012: 0013: import markdown 0014: md = Markdown() 0015: html = md.convert(your_text_string) 0016: 0017: See http://www.freewisdom.org/projects/python-markdown/ for more 0018: information and instructions on how to extend the functionality of the 0019: script. (You might want to read that before you try modifying this 0020: file.) 0021: 0022: Started by [Manfred Stienstra](http://www.dwerg.net/). Continued and 0023: maintained by [Yuri Takhteyev](http://www.freewisdom.org) and [Waylan 0024: Limberg](http://achinghead.com/). 0025: 0026: Contact: yuri [at] freewisdom.org 0027: waylan [at] gmail.com 0028: 0029: License: GPL 2 (http://www.gnu.org/copyleft/gpl.html) or BSD 0030: 0031: """
From there, I jump to the very end of the file:
1853: if __name__ == '__main__': 1854: """ Run Markdown from the command line. """ 1855: options = parse_options() 1856: if not options: 1857: sys.exit(0) 1858: markdownFromFile(**options)
This is a standard Python idiom: if your __name__
variable has been set to '__main__', you're the main
program, so parse any command-line arguments you've been given and
then execute them. Execution is handled by
markdownFromFile; reading that will probably take me into
the actual parsing, so I'll put it off for now. Options are handled
by parse_options, which looks like this:
1787: OPTPARSE_WARNING = """
1788: Python 2.3 or higher required for advanced command line options.
1789: For lower versions of Python use:
1790:
1791: %s INPUT_FILE > OUTPUT_FILE
1792:
1793: """ % EXECUTABLE_NAME_FOR_USAGE
1794:
1795: def parse_options():
1796:
1797: try:
1798: optparse = __import__("optparse")
1799: except:
1800: if len(sys.argv) == 2:
1801: return {'input': sys.argv[1],
1802: 'output': None,
1803: 'message_threshold': CRITICAL,
1804: 'safe': False,
1805: 'extensions': [],
1806: 'encoding': None }
1807:
1808: else:
1809: print OPTPARSE_WARNING
1810: return None
1811:
1812: parser = optparse.OptionParser(usage="%prog INPUTFILE [options]")
1813:
1814: parser.add_option("-f", "--file", dest="filename",
1815: help="write output to OUTPUT_FILE",
1816: metavar="OUTPUT_FILE")
1817: parser.add_option("-e", "--encoding", dest="encoding",
1818: help="encoding for input and output files",)
1819: parser.add_option("-q", "--quiet", default = CRITICAL,
1820: action="store_const", const=60, dest="verbose",
1821: help="suppress all messages")
1822: parser.add_option("-v", "--verbose",
1823: action="store_const", const=INFO, dest="verbose",
1824: help="print info messages")
1825: parser.add_option("-s", "--safe", dest="safe", default=False,
1826: metavar="SAFE_MODE",
1827: help="safe mode ('replace', 'remove' or 'escape' user's HTML tag)")
1828:
1829: parser.add_option("--noisy",
1830: action="store_const", const=DEBUG, dest="verbose",
1831: help="print debug messages")
1832: parser.add_option("-x", "--extension", action="append", dest="extensions",
1833: help = "load extension EXTENSION", metavar="EXTENSION")
1834:
1835: (options, args) = parser.parse_args()
1836:
1837: if not len(args) == 1:
1838: parser.print_help()
1839: return None
1840: else:
1841: input_file = args[0]
1842:
1843: if not options.extensions:
1844: options.extensions = []
1845:
1846: return {'input': input_file,
1847: 'output': options.filename,
1848: 'message_threshold': options.verbose,
1849: 'safe': options.safe,
1850: 'extensions': options.extensions,
1851: 'encoding': options.encoding }
The first thing is to find EXECUTABLE_NAME_FOR_USAGE,
which is back up at the top of the file:
0088: # The following constant specifies the name used in the usage 0089: # statement displayed for python versions lower than 2.3. (With 0090: # python2.3 and higher the usage statement is generated by optparse 0091: # and uses the actual name of the executable called.) 0092: 0093: EXECUTABLE_NAME_FOR_USAGE = "python markdown.py"
That's clear enough, and it tells me that the authors have taken
some time to think about compatibility with older versions of Python.
What about that call to __import__ on line 1798? It seems that Markdown-1 only imports the
option parsing library when it needs to, but why not just use an
import statement? Why use the functional form? Hm...
Better ask the authors...
(Note: this bit of code highlights one of the big differences
between C++ and Java on the one hand, and Python and Ruby on the
other. In Java, import is a declaration about the
program: it tells the compiler "I'm going to link against the
following." In Python, it's a command, meaning, "Load and compile the
following file right now." You can actually import code dynamically
in Java, and even in some implementations of C++ (if the author of the
code being imported has laid the groundwork), but it's a lot harder.
On the other hand, by making promises to the compiler, you're giving
it a chance to do optimizations that so far aren't possible for more
dynamic languages.)
Next, I take a closer look at what the options parsing function returns:
1846: return {'input': input_file,
1847: 'output': options.filename,
1848: 'message_threshold': options.verbose,
1849: 'safe': options.safe,
1850: 'extensions': options.extensions,
1851: 'encoding': options.encoding }
This is another common idiom: since the options can all be
represented as key/value pairs, the function just stuffs them into a
dictionary. The first two are self-explanatory; the third is... hang
on... yes, it's used in markdownFromFile to specify how
verbose error reporting and logging should be. No problem: that's
just Python's standard logging library. Oh, but when I search in the
file for message_threshold, I find this near the top of
the file:
0039: MESSAGE_THRESHOLD = CRITICAL
0040:
0041: # Configure debug message logger (the hard way - to support python 2.3)
0042: logger = getLogger('MARKDOWN')
0043: logger.setLevel(DEBUG) # This is restricted by handlers later
0044: console_hndlr = StreamHandler()
0045: formatter = Formatter('%(name)s-%(levelname)s: "%(message)s"')
0046: console_hndlr.setFormatter(formatter)
0047: console_hndlr.setLevel(MESSAGE_THRESHOLD)
0048: logger.addHandler(console_hndlr)
0049:
0050: def message(level, text):
0051: ''' A wrapper method for logging debug messages. '''
0052: logger.log(level, text)
Again, nothing surprising: there's even a comment on line 41 telling me that logging is being initialized "the hard way" for backward compatibility. (I don't know what the easy way would be, but I don't need to care right now.)
The other three values in the options dictionary are called
"safe", "extensions", and
"encoding". Since this is a parser, the latter almost
certainly means "character encoding"; I've read Joel Spolsky's excellent
article about Unicode and
character sets, so I don't expect to have any trouble figuring out
how that's used. (I actually do, but we'll get to that later.) What
about "safe"? Looking back in the body of the function,
I see:
1825: parser.add_option("-s", "--safe", dest="safe", default=False,
1826: metavar="SAFE_MODE",
1827: help="safe mode ('replace', 'remove' or 'escape' user's HTML tag)")
"Replace, remove, or escape user's HTML tag" is a good hint. The world is full of evil people who enjoy putting link spam in wiki pages; apparently, Markdown-1 gives me the option of taking any raw HTML in the page being processed and either replace it with something (I need to figure out what), remove it entirely, or escape it (i.e., turn every "<" into "<" and so on). Ideally, this choice will be implemented in a single switch statement (in Python, an if/elif/elif/oops block), or via some moral equivalent such as function references or polymorphic methods.
(Note: I say "if/elif/elif/oops" because variant paths should almost always be implemented as:
if case A:
do A
elif case B:
do B
elif case C:
do C
else:
report a fatal error and halt
because both:
if case A:
do A
elif case B:
do B
else:
do C
and:
if case A:
do A
elif case B:
do B
elif case C:
do C
do the wrong thing if a case D is ever added. If there are three choices, the odds are good there will one day be a fourth; plan for it.)
The last option I have to figure out is "extensions",
which is set up like this:
1832: parser.add_option("-x", "--extension", action="append", dest="extensions",
1833: help = "load extension EXTENSION", metavar="EXTENSION")
That isn't much help: I know that Markdown-1 allows user-written
extensions, so all this tells me is that I have to use
"-x" to load one---presumably from a file. I'll have to
search further to figure out what "loading" means, what format
extensions have to be in, and so on.
Back to the top of the file again. I've ticked off the first few lines (literally: it's the only way to keep track of what I still haven't read), and the next four are:
0034: import re, sys, os, random, codecs 0035: 0036: from logging import getLogger, StreamHandler, Formatter, \ 0037: DEBUG, INFO, WARN, ERROR, CRITICAL
Regular expressions, the system library, the OS library for file manipulation... Um, the random number generator? Really? What's that for? And then codecs, to handle different character encodings, and then the logging library for debugging messages, warnings, and so on. But why would a parser need random numbers?
Turns out it doesn't: a search through the file for the word "random" turns up no other hits. What happens if I take it out and run the tests? That wouldn't guarantee that it isn't being used, but it would be strong supporting evidence.
So how do I run the tests? Ah---there's a file called
test-markdown.py in the root directory of the release.
When I run it, it---oh. It gives me this:
bash-3.2$ python test-markdown.py markdown Traceback (most recent call last): File "test-markdown.py", line 351, intestDirectory("tests/markdown-test", measure_time=True) File "test-markdown.py", line 215, in testDirectory htmlDiffFile = codecs.open(htmlDiffFilePath, "w", encoding=encoding) File "c:\Python25\lib\codecs.py", line 791, in open file = __builtin__.open(filename, mode, buffering) IOError: [Errno 2] No such file or directory: '/tmp/markdown-test.html'
Sure enough, it has /tmp hardwired as temporary
directory. I'm running on Windows with Cygwin; I'll just change that
temporarily to ./tmp, make the required directory,
and---yup, no problem, now it runs.
OK, back to that random number module. I take it out of the
import list and re-run the tests---everything's happy.
But finding one unused library makes me suspicious, so I search for
the others, and sure enough, it turns out that the os
module isn't being used either. I take them both out, and make a note
to send a patch to the library's authors. There's still 1800 lines of
code to read, though...
The next few lines in the module are:
0054: # --------------- CONSTANTS YOU MIGHT WANT TO MODIFY ----------------- 0055: 0056: TAB_LENGTH = 4 # expand tabs to this many spaces 0057: ENABLE_ATTRIBUTES = True # @id = xyz -> <... id="xyz"> 0058: SMART_EMPHASIS = 1 # this_or_that does not become this<i>or</i>that 0059: HTML_REMOVED_TEXT = "[HTML_REMOVED]" # text used instead of HTML in safe mode
Cool---symbolic constants. TAB_LENGTH explains
itself; ENABLE_ATTRIBUTES is, um, I'll come back to that.
SMART_EMPHASIS makes sense: underscores can be used to
show emphasis in Markdown, and this controls whether it takes effect
when it's embedded in other text. That just leaves
HTML_REMOVED_TEXT, which must connect back to the
"remove" option for raw HTML. I can check by searching :
0905: class RawHtmlTextPostprocessor(Postprocessor):
...
0910: def run(self, text):
0911: for i in range(self.stash.html_counter):
0912: html, safe = self.stash.rawHtmlBlocks[i]
0913: if self.safeMode and not safe:
0914: if str(self.safeMode).lower() == 'escape':
0915: html = self.escape(html)
0916: elif str(self.safeMode).lower() == 'remove':
0917: html = ''
0918: else:
0919: html = HTML_REMOVED_TEXT
Uh oh: this is exactly the fall-through that I warned about. What
it's saying is, "If the safe mode isn't escape or remove, it must be
replace," but if we ever add a fourth escape mode, that won't be true.
It would be paranoid better to add another
elif, and have the else complain and exit.
And while I'm here: why is the safe mode being converted to lower case
before being checked? Does that mean there's some path through the
code that could make it upper case? Make... another... note...
Back to ENABLE_ATTRIBUTES. It is used in the
toxml method of a class called Element:
0286: def toxml(self): 0287: if ENABLE_ATTRIBUTES: 0288: for child in self.childNodes: 0289: child.handleAttributes()
and in the handleMatch method of a class called
ImagePattern:
0760: def handleMatch(self, m, doc):
0761: el = doc.createElement('img')
0762: src_parts = m.group(9).split()
0763: if src_parts:
0764: el.setAttribute('src', src_parts[0])
0765: else:
0766: el.setAttribute('src', "")
0767: if len(src_parts) > 1:
0768: el.setAttribute('title', dequote(" ".join(src_parts[1:])))
0769: if ENABLE_ATTRIBUTES:
0770: text = doc.createTextNode(m.group(2))
0771: el.appendChild(text)
0772: text.handleAttributes()
0773: truealt = text.value
0774: el.childNodes.remove(text)
0775: else:
0776: truealt = m.group(2)
0777: el.setAttribute('alt', truealt)
0778: return el
I'll ignore the second use for now, by which I mean, I'll circle
the use of ENABLE_ATTRIBUTES, put a question mark beside
it, and worry about it when I get there. Searching for the method
associated with the first
use---handleAttributes---doesn't turn up anything
particularly interesting; there are several definitions of the method,
all but one of which do nothing but pass. The one that
doesn't just calls a callback, and... OK, I'll skip over this for now
as well. Maybe it'll become obvious once I know more about the code.
And maybe then it'll be clear why ENABLE_ATTRIBUTES is
set to 1 (implying an enumeration), even though it appears to be a
Boolean flag. It can't be backward compatibility:
SMART_EMPHASIS, on the line below, is set to
True.
The constants are followed by this rather cryptic block:
0061: RTL_BIDI_RANGES = ( (u'\u0590', u'\u07FF'),
0062: # from Hebrew to Nko (includes Arabic, Syriac and Thaana)
0063: (u'\u2D30', u'\u2D7F'),
0064: # Tifinagh
0065: )
0066:
0067: # Unicode Reference Table:
0068: # 0590-05FF - Hebrew
0069: # 0600-06FF - Arabic
0070: # 0700-074F - Syriac
0071: # 0750-077F - Arabic Supplement
0072: # 0780-07BF - Thaana
0073: # 07C0-07FF - Nko
0074:
0075: BOMS = { 'utf-8': (codecs.BOM_UTF8, ),
0076: 'utf-16': (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE),
0077: #'utf-32': (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)
0078: }
0079:
0080: def removeBOM(text, encoding):
0081: convert = isinstance(text, unicode)
0082: for bom in BOMS[encoding]:
0083: bom = convert and bom.decode(encoding) or bom
0084: if text.startswith(bom):
0085: return text.lstrip(bom)
0086: return text
What's "RTL"? And "BIDI"? The clue is in the names of the character sets: Hebrew and Arabic are both read right-to-left, or RTL. So does "BIDI" mean "bidirectional"? Searching forward finds this:
0134: def getBidiType(text): 0135: 0136: if not text: return None 0137: 0138: ch = text[0] 0139: 0140: if not isinstance(ch, unicode) or not ch.isalpha(): 0141: return None 0142: 0143: else: 0144: 0145: for min, max in RTL_BIDI_RANGES: 0146: if ( ch >= min and ch <= max ): 0147: return "rtl" 0148: else: 0149: return "ltr"
Interesting: given a piece of text, this returns None
if there isn't enough information to determine its reading order, or
one of the two strings "rtl" and "ltr" if
there is. This means the caller has to handle two data types (none
and string) coming back from a single function, which probably means
that there's a default setting which is
only overridden if this function can actually determine an answer.
But that's just a guess; I'll put a question mark on it and come back
later.
Now, what about "BOM"? Google tells me that it stands for "bill of
materials" or "byte order mark". The second sounds more likely: it's
the two-byte Unicode character U+FEFF, more properly called "ZERO
WIDTH NO-BREAK SPACE", which can be put at the start of a file (or any
other text stream) to help readers determine whether the data is
big-endian or little-endian. The Wikipedia
article has the details; suffice to say for now that the only
place the BOMS dictionary is used is in
removeBOM, which is only ever called at the start of
markdownFromFile. I'm therefore going to guess that all
this is doing is getting rid of the first two characters of the input
data if need be.
But: why is line 77 commented out? And what exactly is line 83 doing? I'll have to mail the authors to get an answer to the first; as for the second, the syntax "X and Y or Z" is an old-fashioned way of writing a one-line conditional in Python. If X is true (i.e., not zero, the empty string, or some equivalent value), the expression produces Y; if X is false, it produces Z. Unless Y is false as well, of course, in which case you also get Z. The modern way to write this is "Y if X else Z", but again, the authors want to be compatible with older versions of Python. Anyway, I assume this code has been tested, so the expression must work, so I won't worry about it for now.
Next, another little block:
0095: # --------------- CONSTANTS YOU _SHOULD NOT_ HAVE TO CHANGE ---------- 0096: 0097: # a template for html placeholders 0098: HTML_PLACEHOLDER_PREFIX = "qaodmasdkwaspemas" 0099: HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%dajkqlsmdqpakldnzsdfls" 0100: 0101: BLOCK_LEVEL_ELEMENTS = ['p', 'div', 'blockquote', 'pre', 'table', 0102: 'dl', 'ol', 'ul', 'script', 'noscript', 0103: 'form', 'fieldset', 'iframe', 'math', 'ins', 0104: 'del', 'hr', 'hr/', 'style'] 0105: 0106: def is_block_level (tag): 0107: return ( (tag in BLOCK_LEVEL_ELEMENTS) or 0108: (tag[0] == 'h' and tag[1] in "0123456789") )
I can guess what lines 98--99 are for: whenever something needs to be taken out
of the text temporarily, a garbage string that's very (very) unlikely
to ever appear in a wiki page is inserted. Each garbage string has a
unique counter embedded in it (notice the "%d" format at
the start of the second half of the placeholder string?); that
suggests that whenever text is taken out, it's stored in a dictionary
with the placeholder (or possibly integer) as a key.
The other Markdown library did something similar, but in a safer way. Instead of choosing a fixed string, it constructed an MD5 hash of the text being taken out. I don't know what the odds of "qaodmasdkwaspemas" or "ajkqlsmdqpakldnzsdfls" appearing in a wiki page are, but I do know that MD5 is cryptographically very strong, which means the odds are infinitesimal compared with things like the number of atoms in the universe or the number of yammering idiots with cellphones who seem to think that everyone on the streetcar wants to know exactly how drunk they got at Adi's party on Friday.
Anyway: BLOCK_LEVEL_ELEMENTS pretty much explains
itself. I would use a set rather than a list, particularly if tags
are going to be looked up frequently, and I would insert the various
header tags (h1, h2, and so on) into the
structure to make is_block_level's logic simpler, and oh
yeah, call it isBlockLevel to be consistent with the way
other functions and methods in this code are named, but the important
thing for now is that I understand what it's doing.
The next block in the file is 266 lines long, or 15% of the total. My heart sank when I read the title block:
0110: """ 0111: ====================================================================== 0112: ========================== NANODOM =================================== 0113: ====================================================================== 0114: 0115: The three classes below implement some of the most basic DOM 0116: methods. I use this instead of minidom because I need a simpler 0117: functionality and do not want to require additional libraries. 0118: 0119: Importantly, NanoDom does not do normalization, which is what we 0120: want. It also adds extra white space when converting DOM to string 0121: """
Another bunch of classes for representing XML as a tree in Python? xml.dom.minidom has been in the library for years, and ElementTree was added in Python 2.5. Do we really need another? And why is the header block formatted like a docstring? Don't those only work at the start of files, classes, and methods?
To answer the last question first, yes, but you create a string constant anywhere you like, and a lot of people now seem to use triple-quoted string blocks instead of block comments. As for the other questions, I mailed Yuri:
> Why your own little implementation of DOM? There's a comment about > text normalization --- can you please tell me what the problem was? Mostly to remove the dependence on dom, but also because automatically normalizes entities which was getting to be annoying. I.e., markdown allows you to include entities, e.g., ©. If you put them into dom, you will later get &copy; It's possible to work around it, but it was ending up ugly.
That makes sense. It doesn't necessarily convince me---I'd need to work through the unescaping that would be required if minidom or ElementTree were used---but I'll leave it for now. On the bright side, it's 266 lines of code I don't need to read...
...which of course turns out not to be true. There are just enough differences between Nanodom and [name of your favorite XML library goes here] that I do actually have to read the code. I won't include all of the source here, but here are the highlights.
1. two tables of regular expressions called
ENTITY_NORMALIZATION_EXPRESSIONS and
ENTITY_NORMALIZATION_EXPRESSIONS_SOFT. The only difference is the way
ampersands are matched: the first uses "&", while
the second uses "&(?!\#)". The sub-expression
"(?!\#)" means "match only if not followed by a
literal # sign". The two tables are used in the
Document.normalizeEntities method:
0152: class Document:
...
0192: def normalizeEntities(self, text, avoidDoubleNormalizing=False):
0193:
0194: if avoidDoubleNormalizing:
0195: regexps = ENTITY_NORMALIZATION_EXPRESSIONS_SOFT
0196: else:
0197: regexps = ENTITY_NORMALIZATION_EXPRESSIONS
0198:
0199: for regexp, substitution in regexps:
0200: text = regexp.sub(substitution, text)
0201: return text
At this point, I have no idea what the difference is; it's another question to send to the module's authors.
2. Document.__init__ sets self.bidi to
"ltr". Its setBidi method is then:
0163: def setBidi(self, bidi): 0164: if bidi: 0165: self.bidi = bidi
which matches my earlier guess about how bidirectional flags are being handled.
3. Lines 179--182 create a cache of entity references:
0179: def createEntityReference(self, entity): 0180: if entity not in self.entities: 0181: self.entities[entity] = EntityReference(entity) 0182: return self.entities[entity]
The only place this is used is in
AutomailPattern.handleMatch:
0826: class AutomailPattern (Pattern):
0827:
0828: def handleMatch(self, m, doc):
0829: el = doc.createElement('a')
0830: email = m.group(2)
0831: if email.startswith("mailto:"):
0832: email = email[len("mailto:"):]
0833: for letter in email:
0834: entity = doc.createEntityReference("#%d" % ord(letter))
0835: el.appendChild(entity)
0836: mailto = "mailto:" + email
0837: mailto = "".join(['&#%d;' % ord(letter) for letter in mailto])
0838: el.setAttribute('href', mailto)
0839: return el
The first time I read this, I had no idea what it was doing. I
mean, I could tell that it was replacing each character in a
mailto: link with a character references, but I had no
idea why. Then I came across this while looking through the Markdown
syntax spec for something completely different:
Markdown supports a shortcut style for creating "automatic" links for URLs and email addresses: simply surround the URL or email address with angle brackets. What this means is that if you want to show the actual text of a URL or email address, and also have it be a clickable link, you can do this: <http://example.com/> Markdown will turn this into: <a href="http://example.com/">http://example.com/</a> Automatic links for email addresses work similarly, except that Markdown will also perform a bit of randomized decimal and hex entity-encoding to help obscure your address from address-harvesting spambots. For example, Markdown will turn this: <address@example.com> into something like this: <a href="mailto:addre ss@example.co m">address@exa mple.com</a> which will render in a browser as a clickable link to "address@example.com".
All right: I doubt it's particularly effective, but it's another 20-odd lines I now understand.
4. The Element class contains this:
0225: class Element:
...
0238: def setBidi(self, bidi):
0239: if bidi:
0240: orig_bidi = self.bidi
0241: if not self.bidi or self.isDocumentElement:
0242: # Once the bidi is set don't change it (except for doc element)
0243: self.bidi = bidi
0244: self.parent.setBidi(bidi)
In other words, only the BIDI of the root document element can be
changed, and when it is, it changes the BIDI of its parent (i.e., the
document itself). This feels hacky, and I don't know what
orig_bidi is for (a leftover from a debugging
session?).
5. Element.toxml uses the expected recursive
node-to-string-and-concatenate approach. It sometimes inserts an
extra newline (presumably to make the generated HTML more readable),
but there are a couple of things that are more puzzling. Number one
is this, right at the top of the method:
0286: def toxml(self):
0287: if ENABLE_ATTRIBUTES:
0288: for child in self.childNodes:
0289: child.handleAttributes()
...
I still don't understand what this is about; really need to ask the authors. The second is further down: after the children of this node have been processed, there is a block of code that reads:
0311: buffer += "<" + self.nodeName
0312: if self.nodeName in ['p', 'li', 'ul', 'ol',
0313: 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
0314: if not self.attribute_values.has_key("dir"):
0315: if self.bidi:
0316: bidi = self.bidi
0317: else:
0318: bidi = self.doc.bidi
0319: if bidi=="rtl":
0320: self.setAttribute("dir", "rtl")
0321:
0322: for attr in self.attributes:
0323: value = self.attribute_values[attr]
0324: value = self.doc.normalizeEntities(value,
0325: avoidDoubleNormalizing=True)
0326: buffer += ' %s="%s"' % (attr, value)
I've been at this for about an hour now, and I can feel myself
getting tired, so I'm going to draw a big circle around this, put a
question mark beside it, and come back to it later. It's obviously
something to do with bidirectionality---specifying the direction of a
block of text, maybe---but then there's entity normalization and a
call to normalizeEntities that sets
avoidDoubleNormalizing to True to force it
to use the end-limited regular
expression for matching ampersands. Wait a second: the syntax for
a character entity in HTML is &#dddd, where
dddd represents some hexadecimal digits. Does this mean
that the flag and extra table are there to prevent us escaping the
leading ampersand of character entities we've already generated?
6. Then I get to the TextNode class:
0337: class TextNode:
0338:
0339: type = "text"
0340: attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123}
0341:
0342: def __init__ (self, text):
0343: self.value = text
0344:
0345: def attributeCallback(self, match):
0346:
0347: self.parent.setAttribute(match.group(1), match.group(2))
0348:
0349: def handleAttributes(self):
0350: self.value = self.attrRegExp.sub(self.attributeCallback, self.value)
0351:
0352: def toxml(self):
0353: text = self.value
0354: self.parent.setBidi(getBidiType(text))
0355: if not text.startswith(HTML_PLACEHOLDER_PREFIX):
0356: if self.parent.nodeName == "p":
0357: text = text.replace("\n", "\n ")
0358: elif (self.parent.nodeName == "li"
0359: and self.parent.childNodes[0]==self):
0360: text = "\n " + text.replace("\n", "\n ")
0361: text = self.doc.normalizeEntities(text)
0362: return text
There's a couple of goodies in this:
{@foo=bar} in a
block of text in Markdown adds foo as an attribute to the
enclosing node with "bar" as a value. Nice hack, and now
that I know it's there, other things I've seen make sense.TextNode.toxml sets the BIDI of the parent node to
whatever's appropriate for this text, but as I've already discovered,
that's ignored unless we're at the root. Hm...Reading the Nanodom code gave me a better idea of how the parser as a whole is going to work, in the same way that the tools someone hands you tells you if you're going to be doing carpentry or brain surgery. With it out of the way, I hit this:
0379: """ 0380: ====================================================================== 0381: ========================== PRE-PROCESSORS ============================ 0382: ====================================================================== 0383: 0384: Preprocessors munge source text before we start doing anything too 0385: complicated. 0386: 0387: Each preprocessor implements a "run" method that takes a pointer to a 0388: list of lines of the document, modifies it as necessary and returns 0389: either the same pointer or a pointer to a new list. Preprocessors 0390: must extend markdown.Preprocessor. 0391: 0392: """ 0392: 0392: 0395: class Preprocessor: 0396: pass
I'm not sure why preprocessors have to extend
Preprocessor---Python uses duck typing,
so anything with a run method that takes and returns a
list of lines ought to be usable---but that's a minor point. What
really matters is that this description of preprocessors neatly
extends the mental model I'm building of the parser's operation:
instead of text->DOM->HTML, it's now
text->lines->DOM->HTML. Take a look at the first
preprocessor:
0399: class HeaderPreprocessor (Preprocessor):
0400: """
0401: Replaces underlined headers with hashed headers to avoid
0402: the nead for lookahead later.
0403: """
0404:
0405: def run (self, lines):
0406:
0407: i = -1
0408: while i+1 < len(lines):
0409: i = i+1
0410: if not lines[i].strip():
0411: continue
0412:
0413: if lines[i].startswith("#"):
0414: lines.insert(i+1, "\n")
0415:
0416: if (i+1 <= len(lines)
0417: and lines[i+1]
0418: and lines[i+1][0] in ['-', '=']):
0419:
0420: underline = lines[i+1].strip()
0421:
0422: if underline == "="*len(underline):
0423: lines[i] = "# " + lines[i].strip()
0424: lines[i+1] = ""
0425: elif underline == "-"*len(underline):
0426: lines[i] = "## " + lines[i].strip()
0427: lines[i+1] = ""
0428:
0429: return lines
0430:
0431: HEADER_PREPROCESSOR = HeaderPreprocessor()
It's immediately obvious what this does---if you've read
the Markdown documentation, and know that it allows h1
headers to be written as either:
| # Title | or | Title ===== |
and h2 headers as:
| ## Title ## | or | Title ----- |
Knowing that, this code is pretty easy to follow: it uses one-line
lookahead to find cases of underlined headers, and turns them into
their hash-prefixed equivalents, so that later processing can treat
lines as units. I'm not sure that starting the index at -1 instead of
the more conventional 0 makes anything simpler, but never
mind---somewhere further down in the file, the singleton instance of
this class that's assigned to HEADER_PREPROCESSOR is
going to take the list of lines and (possibly) shrink it down a
little.
The next preprocessor is a little more complicated. Here's the
class header and run method:
0433: class LinePreprocessor (Preprocessor):
0434: """Deals with HR lines (needs to be done before processing lists)"""
0435:
0436: blockquote_re = re.compile(r'^(> )+')
0437:
0438: def run (self, lines):
0439: for i in range(len(lines)):
0440: prefix = ''
0441: m = self.blockquote_re.search(lines[i])
0442: if m : prefix = m.group(0)
0443: if self._isLine(lines[i][len(prefix):]):
0444: lines[i] = prefix + self.stash.store("<hr />", safe=True)
0445: return lines
The docstring tells me that its job is to handle horizontal
rules---the common hr tag. If that's the case, though,
why does it start by looking for block quotes, which are written in
Markdown like this:
> This bunch of lines will be > indented as a block quote.
I think what's happening is that it's expanding horizontal rulers
in block quotes as well: each one is stored in
self.stash, presumably to be put back later. I don't
understand what ":w" means in <:wHR> on line 448---a typo?---but it's not executable, so I won't
worry about it now.
There is something more important thing in this code that I don't
understand. That variable self.stash---where did it come
from? It's not created in LinePreprocessor.__init__,
because there is no such method. And it isn't in
Preprocessor.__init__, because it doesn't exist either.
What's going on?
It turns out that I have to look ahead almost eight hundred lines to the main class in the program to find it:
1099: class Markdown:
1100: """
1101: Markdown formatter class for creating an html document from Markdown text.
1102: """
...
1203: def reset(self):
1204: """
1205: Resets all state variables so that we can start with a new text.
1206: """
1207: self.references={}
1208: self.htmlStash = HtmlStash()
1209:
1210: HTML_BLOCK_PREPROCESSOR.stash = self.htmlStash
1211: LINE_PREPROCESSOR.stash = self.htmlStash
1212: REFERENCE_PREPROCESSOR.references = self.references
1213: HTML_PATTERN.stash = self.htmlStash
1214: ENTITY_PATTERN.stash = self.htmlStash
1215: REFERENCE_PATTERN.references = self.references
1216: IMAGE_REFERENCE_PATTERN.references = self.references
1217: RAWHTMLTEXTPOSTPROCESSOR.stash = self.htmlStash
1218: RAWHTMLTEXTPOSTPROCESSOR.safeMode = self.safeMode
1219:
1220: for extension in self.registeredExtensions:
1221: extension.reset()
That's legal, but it's not pretty. Python doesn't distinguish
between "public" and "private" class members like C++ and Java do (but
see the discussion below). Any piece of
code can add, change, or delete any member of any object simply by
assigning to it. It is almost universally considered bad style,
though, since it means that people reading a class's definition can't
know what instances of that class actually consist of at runtime. It
also makes the code very fragile: if I tried to re-use
LinePreprocessor in some other application, it would
almost certainly fail.
I think it would be cleaner to add a reset method to
Preprocessor that does nothing:
class Preprocessor:
def reset(self, **kwargs):
pass
and then have each derived class override it as appropriate:
class LinePreprocessor(Preprocessor):
def reset(self, **kwargs):
self.stash = kwargs['stash']
Lines 1210--1218 would then become:
for preproc in [HTML_BLOCK_PREPROCESSOR,
LINE_PREPROCESSOR,
HTML_PATTERN,
ENTITY_PATTERN]:
preproc.reset(stash=self.htmlStash)
for preproc in [REFERENCE_PREPROCESSOR,
REFERENCE_PATTERN,
IMAGE_REFERENCE_PATTERN]:
preproc.reset(references=self.references)
RAWHTMLTEXTPOSTPROCESSOR.reset(stash=self.htmlStash, safeMode=self.safeMode)
This is a lot safer for future programmers to extend: just figure
out what kind of preprocessor you have, and stick the singleton in the
appropriate list. It would mean moving the definition of the
HtmlStash class higher in the code, but that doesn't look
like a problem, since it doesn't depend on anything else:
0941: class HtmlStash: 0942: """ 0943: This class is used for stashing HTML objects that we extract 0944: in the beginning and replace with place-holders. 0945: """ 0946: 0947: def __init__ (self): 0948: self.html_counter = 0 # for counting inline html segments 0949: self.rawHtmlBlocks=[] 0950: 0951: def store(self, html, safe=False): 0952: """ 0953: Saves an HTML segment for later reinsertion. Returns a 0954: placeholder string that needs to be inserted into the 0955: document. 0956: @param html: an html segment 0957: @param safe: label an html segment as safe for safemode 0958: @param inline: label a segmant as inline html 0959: @returns : a placeholder string 0960: """ 0961: self.rawHtmlBlocks.append((html, safe)) 0962: placeholder = HTML_PLACEHOLDER % self.html_counter 0963: self.html_counter += 1 0964: return placeholder
Going back to LinePreprocessor, its other member is a
utility method called _isLine. The leading underscore is as close as Python
gets to information hiding; see the docs for more information. But
look at this method's body:
0447: def _isLine(self, block):
0448: """Determines if a block should be replaced with an <:wHR>"""
0449: if block.startswith(" "): return 0 # a code block
0450: text = "".join([x for x in block if not x.isspace()])
0451: if len(text) <= 2:
0452: return 0
0453: for pattern in ['isline1', 'isline2', 'isline3']:
0454: m = RE.regExp[pattern].match(text)
0455: if (m and m.group(1)):
0456: return 1
0457: else:
0458: return 0
The indentation for code blocks is hardwired on line 449; I'm pretty sure that will show up somewhere else
in the code, so it ought to be lifted up to the constants section at
the top of the file. I don't know what the <=2 test
on lines 451--452 is for:
ignoring single and double dashes, maybe, so that only three or more
counts as a ruler? And what is the RE variable on line
454? A search forward finds this 500-odd lines
later:
1096: RE = CorePatterns()
and a search backward from there finds this:
1067: class CorePatterns:
1068: """
1069: This class is scheduled for removal as part of a refactoring
1070: effort.
1071: """
1072:
1073: patterns = {
1074: 'header': r'(#*)([^#]*)(#*)', # # A title
1075: 'reference-def': r'(\ ?\ ?\ ?)\[([^\]]*)\]:\s*([^ ]*)(.*)',
1076: # [Google]: http://www.google.com/
1077: 'containsline': r'([-]*)$|^([=]*)', # -----, =====, etc.
1078: 'ol': r'[ ]{0,3}[\d]*\.\s+(.*)', # 1. text
1079: 'ul': r'[ ]{0,3}[*+-]\s+(.*)', # "* text"
1080: 'isline1': r'(\**)', # ***
1081: 'isline2': r'(\-*)', # ---
1082: 'isline3': r'(\_*)', # ___
1083: 'tabbed': r'((\t)|( ))(.*)', # an indented line
1084: 'quoted': r'> ?(.*)', # a quoted block ("> ...")
1085: }
1086:
1087: def __init__ (self):
1088:
1089: self.regExp = {}
1090: for key in self.patterns.keys():
1091: self.regExp[key] = re.compile("^%s$" % self.patterns[key],
1092: re.DOTALL)
1093:
1094: self.regExp['containsline'] = re.compile(r'^([-]*)$|^([=]*)$', re.M)
1095:
1096: RE = CorePatterns()
I don't understand why the 'containsline' pattern is
defined (line 1077), and the 'quoted'
pattern duplicates the one on line 436 in
LinePreprocessor, but the docstring does say the
class is scheduled for refactoring. Anyway, lines 1080--1082 confirm my earlier
guess: lines less than two characters long are being weeded out
because a horizontal rule has to be at least three characters long.
But then why not rewrite the pattern so that it only matches three or
more of the character? And why not fold the three patterns into one,
so that _isLine doesn't need to loop? It doesn't matter
right now: I understand this code well enough to move on.
The next preprocessor handles HTML blocks. It's the biggest of the bunch, and I'm feeling a little punchy, so I put a question mark beside it and skip ahead to the reference preprocessor, which is shorter and (hopefully) simpler. The code is:
0561: class ReferencePreprocessor (Preprocessor):
0562:
0563: def run (self, lines):
0564:
0565: new_text = [];
0566: for line in lines:
0567: m = RE.regExp['reference-def'].match(line)
0568: if m:
0569: id = m.group(2).strip().lower()
0570: t = m.group(4).strip() # potential title
0571: if not t:
0572: self.references[id] = (m.group(3), t)
0573: elif (len(t) >= 2
0574: and (t[0] == t[-1] == "\""
0575: or t[0] == t[-1] == "\'"
0576: or (t[0] == "(" and t[-1] == ")") ) ):
0577: self.references[id] = (m.group(3), t[1:-1])
0578: else:
0579: new_text.append(line)
0580: else:
0581: new_text.append(line)
0582:
0583: return new_text #+ "\n"
0584:
0585: REFERENCE_PREPROCESSOR = ReferencePreprocessor()
There's no docstring, but line 567 does refer
to a regular expression called 'reference-def' in the
RE variable I found while figuring out the previous
preprocessor. Just to refresh your memory, its definition is:
The regexp itself is too complicated for me to figure out right away (three leading spaces, all optional---what's that about?), but the example below seems clear enough: text in square brackets, followed by a colon, followed by optional space, followed by anything that isn't a space character. Searching for "reference" in the Markdown syntax spec turns up this:
Reference-style links use a second set of square brackets, inside which you place a label of your choosing to identify the link: This is [an example][id] reference-style link. You can optionally use a space to separate the sets of brackets: This is [an example] [id] reference-style link. Then, anywhere in the document, you define your link label like this, on a line by itself: [id]: http://example.com/ "Optional Title Here"
It looks like Markdown-1 is being a little more general than this---lines 574--576 seem to allow the title to be in double quotes, single quotes, or parentheses---but I feel safe adding this class to my growing model of the software. The only question I scribble down is, "Why does this create a new list, instead of removing lines from the existing list?"
Before diving into the last of the preprocessors, it's worth reviewing how I've been reading this code. I had a rough idea of how it worked before I started, thanks to comments at the head of the file and my own understanding of how text filters are usually written. I've been filling in that model by reading blocks of code, chasing down references and going back to the Markdown syntax documentation whenever I hit something I can't figure out. I've been making lots of marks on a hard copy of the code (printed landscape so that I have plenty of room to scribble). I can't really call them "notes", since most of them are just tick marks, question marks, or circles around bits of code, but the more lines I mark, the faster I read.
So, on to the HtmlBlockPreprocessor. It starts with
four utility methods, none of which are documented or particularly
readable:
0463: class HtmlBlockPreprocessor (Preprocessor):
0464: """Removes html blocks from self.lines"""
0465:
0466: def _get_left_tag(self, block):
0467: return block[1:].replace(">", " ", 1).split()[0].lower()
0468:
0469:
0470: def _get_right_tag(self, left_tag, block):
0471: return block.rstrip()[-len(left_tag)-2:-1].lower()
0472:
0473: def _equal_tags(self, left_tag, right_tag):
0474:
0475: if left_tag in ['?', '?php', 'div']: # handle PHP, etc.
0476: return True
0477: if ("/" + left_tag) == right_tag:
0478: return True
0479: if (right_tag == "--" and left_tag == "--"):
0480: return True
0481: elif left_tag == right_tag[1:] \
0482: and right_tag[0] != "<":
0483: return True
0484: else:
0485: return False
0486:
0487: def _is_oneliner(self, tag):
0488: return (tag in ['hr', 'hr/'])
I'm going to trust that _get_left_tag and
_get_right_tag do what their names suggest---I'm not
happy with lengthy expressions like the one on line 467, but the program does run.
_equal_tags is more problematic: do we really not care
what the right tag is if the left tag is <?php> or
<div>? And _is_oneliner is only
checking for rulers, so presumably "one liner" means "one line across
the rendered page", not "an expression that only takes up one
line".
But onward: the run method starts like this:
0491: def run (self, text):
0492:
0493: new_blocks = []
0494: text = text.split("\n\n")
Whoa whoa whoa---what's with text.split? I thought
preprocessors were given a list of lines as input? A search for
HtmlBlockPreprocessor turns up exactly one use right
after the class's definition:
0558: HTML_BLOCK_PREPROCESSOR = HtmlBlockPreprocessor()
That variable is used in just one place, in the main
Markdown class:
1099: class Markdown:
...
1105: def __init__(self, source=None, # deprecated
1106: extensions=[],
1107: extension_configs=None,
1108: safe_mode = False):
...
1129: self.textPreprocessors = [HTML_BLOCK_PREPROCESSOR]
1130:
1131: self.preprocessors = [HEADER_PREPROCESSOR,
1132: LINE_PREPROCESSOR,
1133: # A footnote preprocessor will
1134: # get inserted here
1135: REFERENCE_PREPROCESSOR]
Ah ha---the documentation was misleading. Despite being derived
from Preprocessor, HtmlBlockPreprocessor
actually works on the text as a whole, rather than on a list of lines.
That makes the rest of run a little easier to---no, it's
still hard to figure out. The text input is being turned
into a list of lines by splitting on double newlines, which I presume
means paragraph breaks (but if two paragraphs are separated by a line
containing nothing but spaces and tabs, they'll be considered one
paragraph). The result is being assigned back to text
(line 494), which I think is bad style, since it
changes the type of the variable, invalidating the unwary reader's
assumptions about it in subsequent code.
Here's the rest of the method:
0496: items = []
0497: left_tag = ''
0498: right_tag = ''
0499: in_tag = False # flag
0500:
0501: for block in text:
0502: if block.startswith("\n"):
0503: block = block[1:]
0504:
0505: if not in_tag:
0506:
0507: if block.startswith("<"):
0508:
0509: left_tag = self._get_left_tag(block)
0510: right_tag = self._get_right_tag(left_tag, block)
0511:
0512: if not (is_block_level(left_tag) \
0513: or block[1] in ["!", "?", "@", "%"]):
0514: new_blocks.append(block)
0515: continue
0516:
0517: if self._is_oneliner(left_tag):
0518: new_blocks.append(block.strip())
0519: continue
0520:
0521: if block[1] == "!":
0522: # is a comment block
0523: left_tag = "--"
0524: right_tag = self._get_right_tag(left_tag, block)
0525: # keep checking conditions below and maybe just append
0526:
0527: if block.rstrip().endswith(">") \
0528: and self._equal_tags(left_tag, right_tag):
0529: new_blocks.append(
0530: self.stash.store(block.strip()))
0531: continue
0532: else: #if not block[1] == "!":
0533: # if is block level tag and is not complete
0534: items.append(block.strip())
0535: in_tag = True
0536: continue
0537:
0538: new_blocks.append(block)
0539:
0540: else:
0541: items.append(block.strip())
0542:
0543: right_tag = self._get_right_tag(left_tag, block)
0544:
0545: if self._equal_tags(left_tag, right_tag):
0546: # if find closing tag
0547: in_tag = False
0548: new_blocks.append(
0549: self.stash.store('\n\n'.join(items)))
0550: items = []
0551:
0552: if items:
0553: new_blocks.append(self.stash.store('\n\n'.join(items)))
0554: new_blocks.append('\n')
0555:
0556: return "\n\n".join(new_blocks)
Quick, what does the if on line 505 line up with? The answer is the
else on line 540, but 35 lines is a
long way to look for a match---I'd probably put the bodies of the
if and else in helper methods. I'd probably
also switch the order around: rather than "if not in tag...else...",
I'd use "if in tag...else...", since non-negated logic is
easier for people to read. I'd probably also strip all leading
newlines, rather than just one (lines 502--503), and---is the call to
_get_right_tag on line 510 guaranteed
to find a matching right tag, or just the next one? Either way, why
does the method re-set right_tag on line 524 if the opening tag is "<!" (i.e., a comment)?
Why use a list on line 513? Why not check if
block[1] is in a string (which would be faster)? How
does the use of "?" here interact with the test on line 475 of _equal_tags that I saw
earlier?
None of these questions are really important right now: as I've said before, the code passes its tests, and I'm not planning to rewrite it. But the temptation to refactor is often strong when you're going through someone else's code. No two programmers do things exactly the same way, and the more you understand the stuff you're reading, the stronger the temptation to "fix" it.
Resist. Life is short, code is long, and if you start down that path, you'll eventually find yourself writing your own programming language because you're not perfectly happy with anyone else's. I want a wiki processor for DrProject that I don't have to maintain, and whose syntax some users are likely to already know. I don't understand this code completely, but I now know enough to move on.
The next big chunk of code is titled "Inline Patterns", but I'm feeling tired again, so I skip ahead looking for something easier. The very next block is:
0871: """ 0872: ====================================================================== 0873: ========================== POST-PROCESSORS =========================== 0874: ====================================================================== 0875: 0876: Markdown also allows post-processors, which are similar to 0877: preprocessors in that they need to implement a "run" method. Unlike 0878: pre-processors, they take a NanoDom document as a parameter and work 0879: with that. 0880: 0881: Post-Processor should extend markdown.Postprocessor. 0882: 0883: There are currently no standard post-processors, but the footnote 0884: extension below uses one. 0885: """ 0886: 0887: class Postprocessor: 0888: pass
It doesn't get much simpler than that. Since it only took a few seconds to read the blurb, I look ahead to the next block:
0891: """ 0892: ====================================================================== 0893: ======================== TEXT-POST-PROCESSORS ======================== 0894: ====================================================================== 0895: 0896: Markdown also allows text-post-processors, which are similar to 0897: textpreprocessors in that they need to implement a "run" method. 0898: Unlike post-processors, they take a text string as a parameter and 0899: should return a string. 0900: 0901: Text-Post-Processors should extend markdown.Postprocessor. 0902: """
Oops: this smells like a violation of the Liskov
Substitution Principle. A postprocessor's run method
is supposed to take a Nanodom document as input; here, though, we have
a derived class whose override of that method takes a different kind
of object as input. It would be cleaner and safer to define two kinds
of postprocessors. But that aside, the only text postprocessor in the
module is pretty easy to understand (even without a docstring):
0905: class RawHtmlTextPostprocessor(Postprocessor):
0906:
0907: def __init__(self):
0908: pass
0909:
0910: def run(self, text):
0911: for i in range(self.stash.html_counter):
0912: html, safe = self.stash.rawHtmlBlocks[i]
0913: if self.safeMode and not safe:
0914: if str(self.safeMode).lower() == 'escape':
0915: html = self.escape(html)
0916: elif str(self.safeMode).lower() == 'remove':
0917: html = ''
0918: else:
0919: html = HTML_REMOVED_TEXT
0920:
0921: text = text.replace("<p>%s\n</p>" % (HTML_PLACEHOLDER % i),
0922: html + "\n")
0923: text = text.replace(HTML_PLACEHOLDER % i, html)
0924: return text
0925:
0926: def escape(self, html):
0927: ''' Basic html escaping '''
0928: html = html.replace('&', '&')
0929: html = html.replace('<', '<')
0930: html = html.replace('>', '>')
0931: return html.replace('"', '"')
0932:
0933: RAWHTMLTEXTPOSTPROCESSOR = RawHtmlTextPostprocessor()
I read through this class earlier;
it's just replacing HTML placeholders with the text that was
originally in the document. I'm a little worried about that
escape method, though; is this really the only place
where HTML gets escaped? A quick search doesn't turn up any other
suspicious calls to replace or escape (I
search by looking for the strings "replace(" and "escape(", since the
words themselves appear many other times in the code), so I make a
note and take a break.
I've spent a couple of hours reading this code so far, and have covered about a thousand lines. This is in keeping with the estimates and advice in the Smart Bear book, but I'm going to have to pick up the pace a bit if I want to finish this afternoon. The next chunk in the file is inline patterns; as the header comment explains:
0587: """ 0588: ====================================================================== 0589: ========================== INLINE PATTERNS =========================== 0590: ====================================================================== 0591: 0592: Inline patterns such as *emphasis* are handled by means of auxiliary 0593: objects, one per pattern. Pattern objects must be instances of classes 0594: that extend markdown.Pattern. Each pattern object uses a single regular 0595: expression and needs support the following methods: 0596: 0597: pattern.getCompiledRegExp() - returns a regular expression 0598: 0599: pattern.handleMatch(m, doc) - takes a match object and returns 0600: a NanoDom node (as a part of the provided 0601: doc) or None 0602: 0603: All of python markdown's built-in patterns subclass from Patter, 0604: but you can add additional patterns that don't. 0605: 0606: Also note that all the regular expressions used by inline must 0607: capture the whole block. For this reason, they all start with 0608: '^(.*)' and end with '(.*)!'. In case with built-in expression 0609: Pattern takes care of adding the "^(.*)" and "(.*)!". 0610: 0611: Finally, the order in which regular expressions are applied is very 0612: important - e.g. if we first replace http://.../ links with <a> tags 0613: and _then_ try to replace inline html, we would end up with a mess. 0614: So, we apply the expressions in the following order: 0615: 0616: * escape and backticks have to go before everything else, so 0617: that we can preempt any markdown patterns by escaping them. 0618: 0619: * then we handle auto-links (must be done before inline html) 0620: 0621: * then we handle inline HTML. At this point we will simply 0622: replace all inline HTML strings with a placeholder and add 0623: the actual HTML to a hash. 0624: 0625: * then inline images (must be done before links) 0626: 0627: * then bracketed links, first regular then reference-style 0628: 0629: * finally we apply strong and emphasis 0630: """
That seems simple enough. The next block is just a bunch of regular expressions with more-or-less comprehensible names and examples in comments---the examples are particularly welcome:
0632: NOBRACKET = r'[^\]\[]*'
0633: BRK = ( r'\[('
0634: + (NOBRACKET + r'(\[')*6
0635: + (NOBRACKET+ r'\])*')*6
0636: + NOBRACKET + r')\]' )
0637: NOIMG = r'(?<!\!)'
0638:
0639: BACKTICK_RE = r'\`([^\`]*)\`' # `e= m*c^2`
0640: DOUBLE_BACKTICK_RE = r'\`\`(.*)\`\`' # ``e=f("`")``
0641: ESCAPE_RE = r'\\(.)' # \<
0642: EMPHASIS_RE = r'\*([^\*]*)\*' # *emphasis*
0643: STRONG_RE = r'\*\*(.*)\*\*' # **strong**
0644: STRONG_EM_RE = r'\*\*\*([^_]*)\*\*\*' # ***strong***
0645:
0646: if SMART_EMPHASIS:
0647: EMPHASIS_2_RE = r'(?<!\S)_(\S[^_]*)_' # _emphasis_
0648: else:
0649: EMPHASIS_2_RE = r'_([^_]*)_' # _emphasis_
0650:
0651: STRONG_2_RE = r'__([^_]*)__' # __strong__
0652: STRONG_EM_2_RE = r'___([^_]*)___' # ___strong___
0653:
0654: LINK_RE = NOIMG + BRK + r'\s*\(([^\)]*)\)' # [text](url)
0655: LINK_ANGLED_RE = NOIMG + BRK + r'\s*\(<([^\)]*)>\)' # [text](<url>)
0656: IMAGE_LINK_RE = r'\!' + BRK + r'\s*\(([^\)]*)\)' # 
0657: REFERENCE_RE = NOIMG + BRK+ r'\s*\[([^\]]*)\]' # [Google][3]
0658: IMAGE_REFERENCE_RE = r'\!' + BRK + '\s*\[([^\]]*)\]' # ![alt text][2]
0659: NOT_STRONG_RE = r'( \* )' # stand-alone * or _
0660: AUTOLINK_RE = r'<(http://[^>]*)>' # <http://www.123.com>
0661: AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>' # <me@example.com>
0662: #HTML_RE = r'(\<[^\>]*\>)' # <...>
0663: HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)' # <...>
0664: ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' # &
0665: LINE_BREAK_RE = r' \n' # two spaces at end of line
0666: LINE_BREAK_2_RE = r' $' # two spaces at end of text
Next up is a class called Pattern compiles and stores
a regular expression:
0669: class Pattern:
0670:
0671: def __init__ (self, pattern):
0672: self.pattern = pattern
0673: self.compiled_re = re.compile("^(.*)%s(.*)$" % pattern, re.DOTALL)
0674:
0675: def getCompiledRegExp (self):
0676: return self.compiled_re
0677:
0678: BasePattern = Pattern # for backward compatibility
I'm not sure why the author bothered; a compiled regexp is a
perfectly good object, and its pattern member stores the
text used to create it, so maybe the point is to wrap the regexp in
(.*) (which matches zero or more of anything)? Reading
on, there are three derived class:
0681: class SimpleTextPattern (Pattern): 0682: 0683: def handleMatch(self, m, doc): 0684: return doc.createTextNode(m.group(2)) 0685: 0686: 0687: class SimpleTagPattern (Pattern): 0688: 0689: def __init__ (self, pattern, tag): 0690: Pattern.__init__(self, pattern) 0691: self.tag = tag 0692: 0693: def handleMatch(self, m, doc): 0694: el = doc.createElement(self.tag) 0695: el.appendChild(doc.createTextNode(m.group(2))) 0696: return el 0697: 0698: 0699: class SubstituteTagPattern (SimpleTagPattern): 0700: 0701: def handleMatch (self, m, doc): 0702: return doc.createElement(self.tag)
Docstrings would have been nice; as it is, I have to guess why there are three, and how each will be used. (As an aside, wouldn't it be great if we could put class diagrams in source code, so that we could see the relationships between classes like this at a glance? I've written about this before; I'm still waiting...)
Onward to a class called BacktickPattern:
0705: class BacktickPattern (Pattern):
0706:
0707: def __init__ (self, pattern):
0708: Pattern.__init__(self, pattern)
0709: self.tag = "code"
0710:
0711: def handleMatch(self, m, doc):
0712: el = doc.createElement(self.tag)
0713: text = m.group(2).strip()
0714: #text = text.replace("&", "&")
0715: el.appendChild(doc.createTextNode(text))
0716: return el
The fact that line 714 is commented out worries me, but everything else seems pretty clear. It would be easier to understand if the regexp was right here, so that I could read it while I'm reading the class that uses it. Instead, I have to search forward to find out that this class is actually used to wrap two patterns:
0845: BACKTICK_PATTERN = BacktickPattern(BACKTICK_RE) 0846: DOUBLE_BACKTICK_PATTERN = BacktickPattern(DOUBLE_BACKTICK_RE)
I then search backward to find their definitions (which I skimmed over, but didn't remember, less than two minutes ago):
0639: BACKTICK_RE = r'\`([^\`]*)\`' # `e= m*c^2`
0640: DOUBLE_BACKTICK_RE = r'\`\`(.*)\`\`' # ``e=f("`")``
and then forward again to find that the pattern objects are used to
build a lookup table in the constructor of the main
Markdown class:
1099: class Markdown:
...
1105: def __init__(self, source=None, # deprecated
1106: extensions=[],
1107: extension_configs=None,
1108: safe_mode = False):
...
1146: self.inlinePatterns = [DOUBLE_BACKTICK_PATTERN,
1147: BACKTICK_PATTERN,
A quick surf through the Markdown syntax documentation tells me that single backticks are used to format inline code blocks, while double backticks are used when the code block contains single backticks.
I think this code is ripe for refactoring. As a first step, it, and many of its siblings, could be put into a table like this:
stuff = (
(r'\`([^\`]*)\`', BacktickPattern),
(r'\`\`(.*)\`\`', BacktickPattern),
...
)
The Markdown class's inlinePatterns
member could then be built as:
self.inlinePatterns = []
for (pat, cls) in stuff:
self.inlinePatterns.append(cls(pat))
There are plenty of other ways to do it as well: for example, store the patterns to be used with each matching class in a class variable, so that they're right there for reading, then have each class self-evaluate to create its own instances. These aren't just tricks; instead, they take advantage of the fact that Python is a dynamic language, and that it's easy to store things like class objects in data structures.
But I will resist the temptation and press
on to the DoubleTagPattern class:
0719: class DoubleTagPattern (SimpleTagPattern):
0720:
0721: def handleMatch(self, m, doc):
0722: tag1, tag2 = self.tag.split(",")
0723: el1 = doc.createElement(tag1)
0724: el2 = doc.createElement(tag2)
0725: el1.appendChild(el2)
0726: el2.appendChild(doc.createTextNode(m.group(2)))
0727: return el1
Its use:
0852: STRONG_EM_PATTERN = DoubleTagPattern(STRONG_EM_RE, 'strong,em') 0853: STRONG_EM_PATTERN_2 = DoubleTagPattern(STRONG_EM_2_RE, 'strong,em')
makes its purpose clear: Markdown represents strong emphasized text
using a single marker, so this class catches that case and wraps the
text in two tags rather than one. How about HtmlPattern?
0730: class HtmlPattern (Pattern): 0731: 0732: def handleMatch (self, m, doc): 0733: rawhtml = m.group(2) 0734: inline = True 0735: place_holder = self.stash.store(rawhtml) 0736: return doc.createTextNode(place_holder)
The pattern used to instantiate its unique instance is:
0663: HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)' # <...>
and again, it's the comment that saves me---this pattern is used to match literal HTML embedded in text.
There are five more of these classes: LinkPattern,
ImagePattern, ReferencePattern (for matching
references to pre-defined URLs), ImageReferencePattern
(ditto, but for images), and AutolinkPattern (which
matches literal URLs embedded in angle brackets). Each is
instantiated one or more times to create matchers for each of the
unique things Markdown supports. Now that I have the general pattern
in my head, I don't bother to read them in detail---their names and
the comments on the regexp's used to instantiate them are enough.
Time for another quick break.
There---one more cup of tea ought to be enough to finish this off. There's a block of "miscellaneous auxiliary classes" next that starts with a "block guru" to find the boundaries of indented blocks:
0935: """
0936: ======================================================================
0937: ========================== MISC AUXILIARY CLASSES ====================
0938: ======================================================================
0939: """
0940:
0965:
0966:
0967: class BlockGuru:
0968:
0969: def _findHead(self, lines, fn, allowBlank=0):
0970: """
0971: Functional magic to help determine boundaries of indented
0972: blocks.
0973:
0974: @param lines: an array of strings
0975: @param fn: a function that returns a substring of a string
0976: if the string matches the necessary criteria
0977: @param allowBlank: specifies whether it's ok to have blank
0978: lines between matching functions
0979: @returns: a list of post processes items and the unused
0980: remainder of the original list"""
...
1027:
1028:
1029: def detabbed_fn(self, line):
1030: """ An auxiliary method to be passed to _findHead """
1031: m = RE.regExp['tabbed'].match(line)
1032: if m:
1033: return m.group(4)
1034: else:
1035: return None
1036:
1037:
1038: def detectTabbed(self, lines):
1039:
1040: return self._findHead(lines, self.detabbed_fn,
1041: allowBlank = 1)
I've omitted the body of _findHead because 42 lines of
loop-wrapped conditionals isn't something I want to read right now.
Having written line-oriented parsers myself, I have a pretty good idea
of how this one is going to work; it (almost certainly) works, and I
(almost certainly) will be able to understand it if I need to. The
last two utility functions are a breeze by comparison:
1044: def print_error(string):
1045: """Print an error string to stderr"""
1046: sys.stderr.write(string +'\n')
1047:
1048:
1049: def dequote(string):
1050: """ Removes quotes from around a string """
1051: if ( ( string.startswith('"') and string.endswith('"'))
1052: or (string.startswith("'") and string.endswith("'")) ):
1053: return string[1:-1]
1054: else:
1055: return string
Finally, Markdown itself, the heart of the whole
parser. I could have started with it, then worked outward through the
utilities and supporting classes to get a top-down view of how this
all hangs together. I could equally well have fired the parser up in
a debugger and stepped through it, skipping over constructor and
method calls the first time, then diving deeper and deeper on
subsequent passes until I had an execution-oriented view. They're all
good strategies, and play well together; no matter which one you use,
you don't want to hit a comment like this:
1057: """ 1058: ====================================================================== 1059: ========================== CORE MARKDOWN ============================= 1060: ====================================================================== 1061: 1062: This stuff is ugly, so if you are thinking of extending the syntax, 1063: see first if you can do it via pre-processors, post-processors, 1064: inline patterns or a combination of the three. 1065: """
Gulp. Markdown is a little over 700 lines long.
That's over a third of the total length of the file, all in one
class. Let's start with method headers:
1105: def __init__(self, source=None, ...): 1175: def registerExtensions(self, extensions, configs): 1199: def registerExtension(self, extension): 1203: def reset(self): 1223: def _transform(self): 1273: def _processSection(self, parent_elem, lines, 1348: def _processHeader(self, parent_elem, paragraph): 1360: def _processParagraph(self, parent_elem, paragraph, inList, looseList): 1379: def _processUList(self, parent_elem, lines, inList): 1383: def _processOList(self, parent_elem, lines, inList): 1388: def _processList(self, parent_elem, lines, inList, listexpr, tag): 1475: def _linesUntil(self, lines, condition): 1490: def _processQuote(self, parent_elem, lines, inList): 1528: def _processCodeBlock(self, parent_elem, lines, inList): 1553: def _handleInlineWrapper (self, line, patternIndex=0): 1579: def _handleInline(self, line): 1601: def _applyPattern(self, line, pattern, patternIndex): 1663: def convert (self, source = None): 1697: def __str__(self):
Right: there's a constructor and four "public methods". The
constructor sticks all the pattern-matching classes defined earlier
into a table, and sets up other configuration values (e.g., sets
self.registeredExtensions to an empty list and
self.stripTopLevelTags to 1). Two of the public methods
are for registering extensions---nothing to worry about there---and
then there's convert, which is the main driver for
converting a blob of Markdown to HTML:
1663: def convert (self, source = None): 1664: """ 1665: Return the document in XHTML format. 1666: @returns: A serialized XHTML body. 1667: """ 1668: 1669: if source is not None: # Allow blank string 1670: self.source = source 1671: 1672: if not self.source: 1673: return u"" 1674: 1675: try: 1676: self.source = unicode(self.source) 1677: except UnicodeDecodeError: 1678: message(CRITICAL, 'UnicodeDecodeError: Markdown only accepts unicode or ascii input.') 1679: return u"" 1680: 1681: for pp in self.textPreprocessors: 1682: self.source = pp.run(self.source) 1683: 1684: doc = self._transform() 1685: xml = doc.toxml() 1686: 1687: # Return everything but the top level tag 1688: if self.stripTopLevelTags: 1689: xml = xml.strip()[23:-7] + "\n" 1690: 1691: for pp in self.textPostprocessors: 1692: xml = pp.run(xml) 1693: 1694: return (self.docType + xml).strip()
Reading through convert gives me a pretty good idea of
how everything else fits together: the preprocessors are run,
_transform is called to turn the result into a tree,
that's turned back into text, um, there's some magic where the first
23 and last 7 characters are stripped off (what's that
about?), then the postprocessors are run and the result is returned.
_transform seems the next logical place to go:
1223: def _transform(self):
1224: """
1225: Transforms the Markdown text into a XHTML body document
1226: @returns: A NanoDom Document
1227: """
1228:
1229: # Setup the document
1230:
1231: self.doc = Document()
1232: self.top_element = self.doc.createElement("span")
1233: self.top_element.appendChild(self.doc.createTextNode('\n'))
1234: self.top_element.setAttribute('class', 'markdown')
1235: self.doc.appendChild(self.top_element)
1236:
1237: # Fixup the source text
1238: text = self.source
1239: text = text.replace("\r\n", "\n").replace("\r", "\n")
1240: text += "\n\n"
1241: text = text.expandtabs(TAB_LENGTH)
1242:
1243: # Split into lines and run the preprocessors that will work with
1244: # self.lines
1245: self.lines = text.split("\n")
1246:
1247: # Run the pre-processors on the lines
1248: for prep in self.preprocessors :
1249: self.lines = prep.run(self.lines)
1250:
1251: # Create a NanoDom tree from the lines and attach it to Document
1252: buffer = []
1253: for line in self.lines:
1254: if line.startswith("#"):
1255: self._processSection(self.top_element, buffer)
1256: buffer = [line]
1257: else:
1258: buffer.append(line)
1259: self._processSection(self.top_element, buffer)
1260:
1261: #self._processSection(self.top_element, self.lines)
1262:
1263: # Not sure why I put this in but let's leave it for now.
1264: self.top_element.appendChild(self.doc.createTextNode('\n'))
1265:
1266: # Run the post-processors
1267: for postprocessor in self.postprocessors:
1268: postprocessor.run(self.doc)
1269:
1270: return self.doc
Oops---my earlier comment was inaccurate.
Markdown.convert only runs the text
preprocessors before calling this method; the line-oriented
preprocessors are called here, on lines 1248--1249. (The same is true of the postprocessors.)
I don't understand why lines starting with '#' are
special (line 1254), or why the commented-out call
to _processSection on line 1261
wasn't just taken out (the call on line 1259 seems
to have taken its place), and even the author doesn't know why an
extra newline is being added on line 1264, but
never mind: the intent is clear.
So, what does _processSection do? The answer is, "a
lot". According to its docstring, its job is to look for high-level
structural elements like lists, block quotes, and so on, then strip
them of their markup and process their content recursively. This
method uses a table to drive calls to four others:
_processUList, _processOList,
_processQuote, and _processCodeBlock:
1273: def _processSection(self, parent_elem, lines,
1274: inList = 0, looseList = 0):
1275: """
1276: Process a section of a source document, looking for high level
1277: structural elements like lists, block quotes, code segments,
1278: html blocks, etc. Some those then get stripped of their high
1279: level markup (e.g. get unindented) and the lower-level markup
1280: is processed recursively.
1281:
1282: @param parent_elem: A NanoDom element to which the content
1283: will be added
1284: @param lines: a list of lines
1285: @param inList: a level
1286: @returns: None
1287: """
1288:
1289: # Loop through lines until none left.
1290: while lines:
1291:
1292: # Check if this section starts with a list, a blockquote or
1293: # a code block
1294:
1295: processFn = { 'ul': self._processUList,
1296: 'ol': self._processOList,
1297: 'quoted': self._processQuote,
1298: 'tabbed': self._processCodeBlock}
1299:
1300: for regexp in ['ul', 'ol', 'quoted', 'tabbed']:
1301: m = RE.regExp[regexp].match(lines[0])
1302: if m:
1303: processFn[regexp](parent_elem, lines, inList)
1304: return
1305:
1306: # We are NOT looking at one of the high-level structures like
1307: # lists or blockquotes. So, it's just a regular paragraph
1308: # (though perhaps nested inside a list or something else). If
1309: # we are NOT inside a list, we just need to look for a blank
1310: # line to find the end of the block. If we ARE inside a
1311: # list, however, we need to consider that a sublist does not
1312: # need to be separated by a blank line. Rather, the following
1313: # markup is legal:
1314: #
1315: # * The top level list item
1316: #
1317: # Another paragraph of the list. This is where we are now.
1318: # * Underneath we might have a sublist.
1319: #
1320:
1321: if inList:
1322:
1323: start, lines = self._linesUntil(lines, (lambda line:
1324: RE.regExp['ul'].match(line)
1325: or RE.regExp['ol'].match(line)
1326: or not line.strip()))
1327:
1328: self._processSection(parent_elem, start,
1329: inList - 1, looseList = looseList)
1330: inList = inList-1
1331:
1332: else: # Ok, so it's just a simple block
1333:
1334: paragraph, lines = self._linesUntil(lines, lambda line:
1335: not line.strip())
1336:
1337: if len(paragraph) and paragraph[0].startswith('#'):
1338: self._processHeader(parent_elem, paragraph)
1339:
1340: elif paragraph:
1341: self._processParagraph(parent_elem, paragraph,
1342: inList, looseList)
1343:
1344: if lines and not lines[0].strip():
1345: lines = lines[1:] # skip the first (blank) line
I would make some changes if I was refactoring:
processFn table to drive the
loop on line 1300 rather than repeating them (and
convert processFn to a list of pairs if the order is
important).while loop starting on line 1290: it's not immediately clear (a) that every pass
through the loop is expected to cut some content out of
lines, or (b) that the early return on line 1304 is safe.inList depth variable
in one spot (OK, this is minor, but it's a habit I picked up the hard
way).There's another 320 lines of code in this class, but I'm going to
skip over it: the "new understanding per line" ratio is dropping off
pretty quickly. Some of this is fatigue, but mostly it's because I
now understand the application well enough that almost everything I
see just confirms things that are already in my mental model. That's
not the same as saying that I already know them; rather, it means that
nothing I see contradicts any of the ideas I've built up about the
code's organization, structure, or operation. If I was
planning to refactor this, I'd read the Markdown class in
detail, since it would be the biggest candidate for reorganization,
but I'm not, so I won't.
The home stretch is the driver functions near the bottom of the
file that process files and their contents.
markdownFromFile is:
1711: def markdownFromFile(input = None, 1712: output = None, 1713: extensions = [], 1714: encoding = None, 1715: message_threshold = CRITICAL, 1716: safe = False): 1717: 1718: global console_hndlr 1719: console_hndlr.setLevel(message_threshold) 1720: 1721: message(DEBUG, "input file: %s" % input) 1722: 1723: if not encoding: 1724: encoding = "utf-8" 1725: 1726: input_file = codecs.open(input, mode="r", encoding=encoding) 1727: text = input_file.read() 1728: input_file.close() 1729: 1730: text = removeBOM(text, encoding) 1731: 1732: new_text = markdown(text, extensions, safe_mode = safe) 1733: 1734: if output: 1735: output_file = codecs.open(output, "w", encoding=encoding) 1736: output_file.write(new_text) 1737: output_file.close() 1738: 1739: else: 1740: sys.stdout.write(new_text.encode(encoding))
No surprises there, and none in the markdown function
either:
1742: def markdown(text,
1743: extensions = [],
1744: safe_mode = False):
1745:
1746: message(DEBUG, "in markdown.markdown(), received text:\n%s" % text)
1747:
1748: extension_names = []
1749: extension_configs = {}
1750:
1751: for ext in extensions:
1752: pos = ext.find("(")
1753: if pos == -1:
1754: extension_names.append(ext)
1755: else:
1756: name = ext[:pos]
1757: extension_names.append(name)
1758: pairs = [x.split("=") for x in ext[pos+1:-1].split(",")]
1759: configs = [(x.strip(), y.strip()) for (x, y) in pairs]
1760: extension_configs[name] = configs
1761:
1762: md = Markdown(extensions=extension_names,
1763: extension_configs=extension_configs,
1764: safe_mode = safe_mode)
1765:
1766: return md.convert(text)
Finally, there's a class called Extension, which
presumably is supposed to be used as the base class for extensions (a
docstring to this effect, and some usage notes, would be useful):
1769: class Extension:
1770:
1771: def __init__(self, configs = {}):
1772: self.config = configs
1773:
1774: def getConfig(self, key):
1775: if self.config.has_key(key):
1776: return self.config[key][0]
1777: else:
1778: return ""
1779:
1780: def getConfigInfo(self):
1781: return [(key, self.config[key][1]) for key in self.config.keys()]
1782:
1783: def setConfig(self, key, value):
1784: self.config[key][0] = value
And that's it; after three hours and twenty minutes, I'm done.
So, was it worth it? I was already 99% sure that the code worked, and as I've said at least four times, I'm not planning to refactor it, so were 200 minutes of close reading a good investment? I think so: it'll probably be easier now for me to write the extensions I'll need to write to handle DrProject-specific links, and debugging anything that blows up will definitely be easier. More importantly, I'm now much more confident that the code will carry the weight we plan to put on it: it's organized the way I'd expect it to be, it uses idioms I'm familiar with, and generally shows plenty of signs of good artisanship.
Now all I have to do is integrate it...
Copyright (c) 2008 Greg Wilson