Reviewing Markdown

One of the things I learned from our January code sprint is that DrProject's existing wiki parser has reached, or passed, the point of no return. What started off in Trac as an almost-comprehensible list of regular expressions has grown into a tangle of special cases so sensitive that any change anywhere breaks something somewhere else.

My first thought was to write a new parser for the same syntax using ANTLR. However, my minions students have convinced me that switching to some other syntax---something that's already widely used---would be a better idea. After a bit of browsing, we settled on Markdown, for which there are two working parsers for it in Python: Markdown-1, written by Manfred Stienstra, Yuri Takhteyev, and Waylan Limberg, and Markdown-2, written by ActiveState's Trent Mick.

So how should I go about choosing which one to use? The first step is to list their strengths. Markdown-1 was written from scratch in Python; Markdown-2 is a fairly direct translation into Python of the Perl Markdown parser. (For example, some of the comments still use Perl notation for regular expressions.) Markdown-1 has been written by three people (lots of eyeballs are good); Markdown-2 by one (a single style is also good). They're similar in length (roughly 1900 and 1680 lines respectively); Markdown-2 claims to be faster and better tested, but no figures are provided on its web site.

Step two: contact the authors. Are both projects still alive, or has either been abandoned? What are their plans for future development? Would they be interested in helping us integrate their code into DrProject? And what do they think of the other project?

Answers came back pretty quickly, and were encouraging: both projects are very much alive. That left step three: read the code. I'd like to believe that we'll be able to treat whichever one we choose as a black box, but past experience tells me that sooner or later (probably sooner), we'll have to dive in and either fix bugs or rearrange things to make it possible to add some extension we absolutely, positively need. I actually did this reading as I was talking via email with the two libraries' authors; they cleared up some questions I had, and Trent even went so far as to make and check in a change that I suggested.

After spending about an hour and a half with each library over the course of a couple of days, I decided to go with Markdown-1. Markdown-2 performance is definitely faster:

$ python perf.py -r 10 tmp-test-cases
Time conversion of tmp-test-cases/*.text (plat=darwin):
  Markdown.pl: best of 10: 0.372s
  markdown.py: best of 10: 0.295s
  markdown2.py: best of 10: 0.221s
$ python perf.py tmp-aspn-cases
Time conversion of tmp-aspn-cases/*.text (plat=darwin):
  Markdown.pl: best of 3: 5.842s
  markdown.py: best of 3: 18.920s
  markdown2.py: best of 3: 4.956s

but (a) our pages tend to be fairly short, and (b) I'm confident Markdown-1 can be tuned. What made me decide in its favor was that it goes from wiki text to DOM to HTML, while Markdown-2 goes straight from wiki text to HTML. This will probably make any structural extensions we want to add simpler to write---while we could always translate to HTML, then convert that to DOM, mess with the structure, and convert back to HTML, hooking into a text->DOM->HTML pipeline seems cleaner. I also found Markdown-1 easier to understand, which is undoubtedly more of a reflection on Markdown-2's Perl antecedents than it is on Trent's programming style ;-).

Now it's time for step 4: re-read the code in detail, and make sure that I understand it well enough to write the extensions we need (for example, one to turn the string "#123" into a link to a ticket). This is also a chance for me to check that I made the right decision: if anything particularly ugly turns up on closer reading, I can still go back to Markdown-2. Finally, I started to talk about code reviews to my undergraduate software engineering class this week; taking notes as I go through Markdown-1 will give me a worked example to put in front of them.

After a little reformatting, the source is 1858 lines long. I didn't read it from start to finish; instead, I jumped around, trying to build a mental model of how the pieces fit together. What follows is a reconstruction of that exploration; I'd welcome comments about things I may have missed, or how else you might have read the code.

Getting Started

The first thing I read is the header. The docstring tells me how to run the program as a standalone application, and where to look for more information:

0001: #!/usr/bin/env python
0002: 
0003: version = "1.7"
0004: version_info = (1,7,0,"rc-1")
0005: __revision__ = "$Rev: 66 $"
0006: 
0007: """
0008: Python-Markdown
0009: ===============
0010: 
0011: Converts Markdown to HTML.  Basic usage as a module:
0012: 
0013:     import markdown
0014:     md = Markdown()
0015:     html = md.convert(your_text_string)
0016: 
0017: See http://www.freewisdom.org/projects/python-markdown/ for more
0018: information and instructions on how to extend the functionality of the
0019: script.  (You might want to read that before you try modifying this
0020: file.)
0021: 
0022: Started by [Manfred Stienstra](http://www.dwerg.net/).  Continued and
0023: maintained  by [Yuri Takhteyev](http://www.freewisdom.org) and [Waylan
0024: Limberg](http://achinghead.com/).
0025: 
0026: Contact: yuri [at] freewisdom.org
0027:          waylan [at] gmail.com
0028: 
0029: License: GPL 2 (http://www.gnu.org/copyleft/gpl.html) or BSD
0030: 
0031: """

From there, I jump to the very end of the file:

1853: if __name__ == '__main__':
1854:     """ Run Markdown from the command line. """
1855:     options = parse_options()
1856:     if not options:
1857:         sys.exit(0)
1858:     markdownFromFile(**options)

This is a standard Python idiom: if your __name__ variable has been set to '__main__', you're the main program, so parse any command-line arguments you've been given and then execute them. Execution is handled by markdownFromFile; reading that will probably take me into the actual parsing, so I'll put it off for now. Options are handled by parse_options, which looks like this:

1787: OPTPARSE_WARNING = """
1788: Python 2.3 or higher required for advanced command line options.
1789: For lower versions of Python use:
1790: 
1791:       %s INPUT_FILE > OUTPUT_FILE
1792: 
1793: """ % EXECUTABLE_NAME_FOR_USAGE
1794: 
1795: def parse_options():
1796: 
1797:     try:
1798:         optparse = __import__("optparse")
1799:     except:
1800:         if len(sys.argv) == 2:
1801:             return {'input': sys.argv[1],
1802:                     'output': None,
1803:                     'message_threshold': CRITICAL,
1804:                     'safe': False,
1805:                     'extensions': [],
1806:                     'encoding': None }
1807: 
1808:         else:
1809:             print OPTPARSE_WARNING
1810:             return None
1811: 
1812:     parser = optparse.OptionParser(usage="%prog INPUTFILE [options]")
1813: 
1814:     parser.add_option("-f", "--file", dest="filename",
1815:                       help="write output to OUTPUT_FILE",
1816:                       metavar="OUTPUT_FILE")
1817:     parser.add_option("-e", "--encoding", dest="encoding",
1818:                       help="encoding for input and output files",)
1819:     parser.add_option("-q", "--quiet", default = CRITICAL,
1820:                       action="store_const", const=60, dest="verbose",
1821:                       help="suppress all messages")
1822:     parser.add_option("-v", "--verbose",
1823:                       action="store_const", const=INFO, dest="verbose",
1824:                       help="print info messages")
1825:     parser.add_option("-s", "--safe", dest="safe", default=False,
1826:                       metavar="SAFE_MODE",
1827:                       help="safe mode ('replace', 'remove' or 'escape'  user's HTML tag)")
1828: 
1829:     parser.add_option("--noisy",
1830:                       action="store_const", const=DEBUG, dest="verbose",
1831:                       help="print debug messages")
1832:     parser.add_option("-x", "--extension", action="append", dest="extensions",
1833:                       help = "load extension EXTENSION", metavar="EXTENSION")
1834: 
1835:     (options, args) = parser.parse_args()
1836: 
1837:     if not len(args) == 1:
1838:         parser.print_help()
1839:         return None
1840:     else:
1841:         input_file = args[0]
1842: 
1843:     if not options.extensions:
1844:         options.extensions = []
1845: 
1846:     return {'input': input_file,
1847:             'output': options.filename,
1848:             'message_threshold': options.verbose,
1849:             'safe': options.safe,
1850:             'extensions': options.extensions,
1851:             'encoding': options.encoding }

The first thing is to find EXECUTABLE_NAME_FOR_USAGE, which is back up at the top of the file:

0088: # The following constant specifies the name used in the usage
0089: # statement displayed for python versions lower than 2.3.  (With
0090: # python2.3 and higher the usage statement is generated by optparse
0091: # and uses the actual name of the executable called.)
0092: 
0093: EXECUTABLE_NAME_FOR_USAGE = "python markdown.py"

That's clear enough, and it tells me that the authors have taken some time to think about compatibility with older versions of Python. What about that call to __import__ on line 1798? It seems that Markdown-1 only imports the option parsing library when it needs to, but why not just use an import statement? Why use the functional form? Hm... Better ask the authors...

(Note: this bit of code highlights one of the big differences between C++ and Java on the one hand, and Python and Ruby on the other. In Java, import is a declaration about the program: it tells the compiler "I'm going to link against the following." In Python, it's a command, meaning, "Load and compile the following file right now." You can actually import code dynamically in Java, and even in some implementations of C++ (if the author of the code being imported has laid the groundwork), but it's a lot harder. On the other hand, by making promises to the compiler, you're giving it a chance to do optimizations that so far aren't possible for more dynamic languages.)


Next, I take a closer look at what the options parsing function returns:

1846:     return {'input': input_file,
1847:             'output': options.filename,
1848:             'message_threshold': options.verbose,
1849:             'safe': options.safe,
1850:             'extensions': options.extensions,
1851:             'encoding': options.encoding }

This is another common idiom: since the options can all be represented as key/value pairs, the function just stuffs them into a dictionary. The first two are self-explanatory; the third is... hang on... yes, it's used in markdownFromFile to specify how verbose error reporting and logging should be. No problem: that's just Python's standard logging library. Oh, but when I search in the file for message_threshold, I find this near the top of the file:

0039: MESSAGE_THRESHOLD = CRITICAL
0040: 
0041: # Configure debug message logger (the hard way - to support python 2.3)
0042: logger = getLogger('MARKDOWN')
0043: logger.setLevel(DEBUG) # This is restricted by handlers later
0044: console_hndlr = StreamHandler()
0045: formatter = Formatter('%(name)s-%(levelname)s: "%(message)s"')
0046: console_hndlr.setFormatter(formatter)
0047: console_hndlr.setLevel(MESSAGE_THRESHOLD)
0048: logger.addHandler(console_hndlr)
0049: 
0050: def message(level, text):
0051:     ''' A wrapper method for logging debug messages. '''
0052:     logger.log(level, text)

Again, nothing surprising: there's even a comment on line 41 telling me that logging is being initialized "the hard way" for backward compatibility. (I don't know what the easy way would be, but I don't need to care right now.)

The other three values in the options dictionary are called "safe", "extensions", and "encoding". Since this is a parser, the latter almost certainly means "character encoding"; I've read Joel Spolsky's excellent article about Unicode and character sets, so I don't expect to have any trouble figuring out how that's used. (I actually do, but we'll get to that later.) What about "safe"? Looking back in the body of the function, I see:

1825:     parser.add_option("-s", "--safe", dest="safe", default=False,
1826:                       metavar="SAFE_MODE",
1827:                       help="safe mode ('replace', 'remove' or 'escape'  user's HTML tag)")

"Replace, remove, or escape user's HTML tag" is a good hint. The world is full of evil people who enjoy putting link spam in wiki pages; apparently, Markdown-1 gives me the option of taking any raw HTML in the page being processed and either replace it with something (I need to figure out what), remove it entirely, or escape it (i.e., turn every "<" into "&lt;" and so on). Ideally, this choice will be implemented in a single switch statement (in Python, an if/elif/elif/oops block), or via some moral equivalent such as function references or polymorphic methods.

(Note: I say "if/elif/elif/oops" because variant paths should almost always be implemented as:

if case A:
    do A
elif case B:
    do B
elif case C:
    do C
else:
    report a fatal error and halt

because both:

if case A:
    do A
elif case B:
    do B
else:
    do C

and:

if case A:
    do A
elif case B:
    do B
elif case C:
    do C

do the wrong thing if a case D is ever added. If there are three choices, the odds are good there will one day be a fourth; plan for it.)

The last option I have to figure out is "extensions", which is set up like this:

1832:     parser.add_option("-x", "--extension", action="append", dest="extensions",
1833:                       help = "load extension EXTENSION", metavar="EXTENSION")

That isn't much help: I know that Markdown-1 allows user-written extensions, so all this tells me is that I have to use "-x" to load one---presumably from a file. I'll have to search further to figure out what "loading" means, what format extensions have to be in, and so on.


Back to the top of the file again. I've ticked off the first few lines (literally: it's the only way to keep track of what I still haven't read), and the next four are:

0034: import re, sys, os, random, codecs
0035: 
0036: from logging import getLogger, StreamHandler, Formatter, \
0037:                     DEBUG, INFO, WARN, ERROR, CRITICAL

Regular expressions, the system library, the OS library for file manipulation... Um, the random number generator? Really? What's that for? And then codecs, to handle different character encodings, and then the logging library for debugging messages, warnings, and so on. But why would a parser need random numbers?

Turns out it doesn't: a search through the file for the word "random" turns up no other hits. What happens if I take it out and run the tests? That wouldn't guarantee that it isn't being used, but it would be strong supporting evidence.

So how do I run the tests? Ah---there's a file called test-markdown.py in the root directory of the release. When I run it, it---oh. It gives me this:

bash-3.2$ python test-markdown.py
markdown
Traceback (most recent call last):
  File "test-markdown.py", line 351, in 
    testDirectory("tests/markdown-test", measure_time=True)
  File "test-markdown.py", line 215, in testDirectory
    htmlDiffFile = codecs.open(htmlDiffFilePath, "w", encoding=encoding)
  File "c:\Python25\lib\codecs.py", line 791, in open
    file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: '/tmp/markdown-test.html'

Sure enough, it has /tmp hardwired as temporary directory. I'm running on Windows with Cygwin; I'll just change that temporarily to ./tmp, make the required directory, and---yup, no problem, now it runs.

OK, back to that random number module. I take it out of the import list and re-run the tests---everything's happy. But finding one unused library makes me suspicious, so I search for the others, and sure enough, it turns out that the os module isn't being used either. I take them both out, and make a note to send a patch to the library's authors. There's still 1800 lines of code to read, though...

More Landmarks

The next few lines in the module are:

0054: # --------------- CONSTANTS YOU MIGHT WANT TO MODIFY -----------------
0055: 
0056: TAB_LENGTH = 4            # expand tabs to this many spaces
0057: ENABLE_ATTRIBUTES = True  # @id = xyz -> <... id="xyz">
0058: SMART_EMPHASIS = 1        # this_or_that does not become this<i>or</i>that
0059: HTML_REMOVED_TEXT = "[HTML_REMOVED]" # text used instead of HTML in safe mode

Cool---symbolic constants. TAB_LENGTH explains itself; ENABLE_ATTRIBUTES is, um, I'll come back to that. SMART_EMPHASIS makes sense: underscores can be used to show emphasis in Markdown, and this controls whether it takes effect when it's embedded in other text. That just leaves HTML_REMOVED_TEXT, which must connect back to the "remove" option for raw HTML. I can check by searching :

0905: class RawHtmlTextPostprocessor(Postprocessor):
          ...
0910:     def run(self, text):
0911:         for i in range(self.stash.html_counter):
0912:             html, safe  = self.stash.rawHtmlBlocks[i]
0913:             if self.safeMode and not safe:
0914:                 if str(self.safeMode).lower() == 'escape':
0915:                     html = self.escape(html)
0916:                 elif str(self.safeMode).lower() == 'remove':
0917:                     html = ''
0918:                 else:
0919:                     html = HTML_REMOVED_TEXT

Uh oh: this is exactly the fall-through that I warned about. What it's saying is, "If the safe mode isn't escape or remove, it must be replace," but if we ever add a fourth escape mode, that won't be true. It would be paranoid better to add another elif, and have the else complain and exit. And while I'm here: why is the safe mode being converted to lower case before being checked? Does that mean there's some path through the code that could make it upper case? Make... another... note...

Back to ENABLE_ATTRIBUTES. It is used in the toxml method of a class called Element:

0286:     def toxml(self):
0287:         if ENABLE_ATTRIBUTES:
0288:             for child in self.childNodes:
0289:                 child.handleAttributes()

and in the handleMatch method of a class called ImagePattern:

0760:     def handleMatch(self, m, doc):
0761:         el = doc.createElement('img')
0762:         src_parts = m.group(9).split()
0763:         if src_parts:
0764:             el.setAttribute('src', src_parts[0])
0765:         else:
0766:             el.setAttribute('src', "")
0767:         if len(src_parts) > 1:
0768:             el.setAttribute('title', dequote(" ".join(src_parts[1:])))
0769:         if ENABLE_ATTRIBUTES:
0770:             text = doc.createTextNode(m.group(2))
0771:             el.appendChild(text)
0772:             text.handleAttributes()
0773:             truealt = text.value
0774:             el.childNodes.remove(text)
0775:         else:
0776:             truealt = m.group(2)
0777:         el.setAttribute('alt', truealt)
0778:         return el

I'll ignore the second use for now, by which I mean, I'll circle the use of ENABLE_ATTRIBUTES, put a question mark beside it, and worry about it when I get there. Searching for the method associated with the first use---handleAttributes---doesn't turn up anything particularly interesting; there are several definitions of the method, all but one of which do nothing but pass. The one that doesn't just calls a callback, and... OK, I'll skip over this for now as well. Maybe it'll become obvious once I know more about the code. And maybe then it'll be clear why ENABLE_ATTRIBUTES is set to 1 (implying an enumeration), even though it appears to be a Boolean flag. It can't be backward compatibility: SMART_EMPHASIS, on the line below, is set to True.


The constants are followed by this rather cryptic block:

0061: RTL_BIDI_RANGES = ( (u'\u0590', u'\u07FF'),
0062:                     # from Hebrew to Nko (includes Arabic, Syriac and Thaana)
0063:                     (u'\u2D30', u'\u2D7F'),
0064:                     # Tifinagh
0065:                     )
0066: 
0067: # Unicode Reference Table:
0068: # 0590-05FF - Hebrew
0069: # 0600-06FF - Arabic
0070: # 0700-074F - Syriac
0071: # 0750-077F - Arabic Supplement
0072: # 0780-07BF - Thaana
0073: # 07C0-07FF - Nko
0074: 
0075: BOMS = { 'utf-8': (codecs.BOM_UTF8, ),
0076:          'utf-16': (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE),
0077:          #'utf-32': (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)
0078:          }
0079: 
0080: def removeBOM(text, encoding):
0081:     convert = isinstance(text, unicode)
0082:     for bom in BOMS[encoding]:
0083:         bom = convert and bom.decode(encoding) or bom
0084:         if text.startswith(bom):
0085:             return text.lstrip(bom)
0086:     return text

What's "RTL"? And "BIDI"? The clue is in the names of the character sets: Hebrew and Arabic are both read right-to-left, or RTL. So does "BIDI" mean "bidirectional"? Searching forward finds this:

0134: def getBidiType(text):
0135: 
0136:     if not text: return None
0137: 
0138:     ch = text[0]
0139: 
0140:     if not isinstance(ch, unicode) or not ch.isalpha():
0141:         return None
0142: 
0143:     else:
0144: 
0145:         for min, max in RTL_BIDI_RANGES:
0146:             if ( ch >= min and ch <= max ):
0147:                 return "rtl"
0148:         else:
0149:             return "ltr"

Interesting: given a piece of text, this returns None if there isn't enough information to determine its reading order, or one of the two strings "rtl" and "ltr" if there is. This means the caller has to handle two data types (none and string) coming back from a single function, which probably means that there's a default setting which is only overridden if this function can actually determine an answer. But that's just a guess; I'll put a question mark on it and come back later.

Now, what about "BOM"? Google tells me that it stands for "bill of materials" or "byte order mark". The second sounds more likely: it's the two-byte Unicode character U+FEFF, more properly called "ZERO WIDTH NO-BREAK SPACE", which can be put at the start of a file (or any other text stream) to help readers determine whether the data is big-endian or little-endian. The Wikipedia article has the details; suffice to say for now that the only place the BOMS dictionary is used is in removeBOM, which is only ever called at the start of markdownFromFile. I'm therefore going to guess that all this is doing is getting rid of the first two characters of the input data if need be.

But: why is line 77 commented out? And what exactly is line 83 doing? I'll have to mail the authors to get an answer to the first; as for the second, the syntax "X and Y or Z" is an old-fashioned way of writing a one-line conditional in Python. If X is true (i.e., not zero, the empty string, or some equivalent value), the expression produces Y; if X is false, it produces Z. Unless Y is false as well, of course, in which case you also get Z. The modern way to write this is "Y if X else Z", but again, the authors want to be compatible with older versions of Python. Anyway, I assume this code has been tested, so the expression must work, so I won't worry about it for now.


Next, another little block:

0095: # --------------- CONSTANTS YOU _SHOULD NOT_ HAVE TO CHANGE ----------
0096: 
0097: # a template for html placeholders
0098: HTML_PLACEHOLDER_PREFIX = "qaodmasdkwaspemas"
0099: HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%dajkqlsmdqpakldnzsdfls"
0100: 
0101: BLOCK_LEVEL_ELEMENTS = ['p', 'div', 'blockquote', 'pre', 'table',
0102:                         'dl', 'ol', 'ul', 'script', 'noscript',
0103:                         'form', 'fieldset', 'iframe', 'math', 'ins',
0104:                         'del', 'hr', 'hr/', 'style']
0105: 
0106: def is_block_level (tag):
0107:     return ( (tag in BLOCK_LEVEL_ELEMENTS) or
0108:              (tag[0] == 'h' and tag[1] in "0123456789") )

I can guess what lines 98--99 are for: whenever something needs to be taken out of the text temporarily, a garbage string that's very (very) unlikely to ever appear in a wiki page is inserted. Each garbage string has a unique counter embedded in it (notice the "%d" format at the start of the second half of the placeholder string?); that suggests that whenever text is taken out, it's stored in a dictionary with the placeholder (or possibly integer) as a key.

The other Markdown library did something similar, but in a safer way. Instead of choosing a fixed string, it constructed an MD5 hash of the text being taken out. I don't know what the odds of "qaodmasdkwaspemas" or "ajkqlsmdqpakldnzsdfls" appearing in a wiki page are, but I do know that MD5 is cryptographically very strong, which means the odds are infinitesimal compared with things like the number of atoms in the universe or the number of yammering idiots with cellphones who seem to think that everyone on the streetcar wants to know exactly how drunk they got at Adi's party on Friday.

Anyway: BLOCK_LEVEL_ELEMENTS pretty much explains itself. I would use a set rather than a list, particularly if tags are going to be looked up frequently, and I would insert the various header tags (h1, h2, and so on) into the structure to make is_block_level's logic simpler, and oh yeah, call it isBlockLevel to be consistent with the way other functions and methods in this code are named, but the important thing for now is that I understand what it's doing.

Nanodom

The next block in the file is 266 lines long, or 15% of the total. My heart sank when I read the title block:

0110: """
0111: ======================================================================
0112: ========================== NANODOM ===================================
0113: ======================================================================
0114: 
0115: The three classes below implement some of the most basic DOM
0116: methods.  I use this instead of minidom because I need a simpler
0117: functionality and do not want to require additional libraries.
0118: 
0119: Importantly, NanoDom does not do normalization, which is what we
0120: want. It also adds extra white space when converting DOM to string
0121: """

Another bunch of classes for representing XML as a tree in Python? xml.dom.minidom has been in the library for years, and ElementTree was added in Python 2.5. Do we really need another? And why is the header block formatted like a docstring? Don't those only work at the start of files, classes, and methods?

To answer the last question first, yes, but you create a string constant anywhere you like, and a lot of people now seem to use triple-quoted string blocks instead of block comments. As for the other questions, I mailed Yuri:


> Why your own little implementation of DOM?  There's a comment about
> text normalization --- can you please tell me what the problem was?

Mostly to remove the dependence on dom, but also because automatically
normalizes entities which was getting to be annoying.  I.e., markdown
allows you to include entities, e.g., &copy;.  If you put them into dom,
you will later get &amp;copy;  It's possible to work around it, but it was
ending up ugly.

That makes sense. It doesn't necessarily convince me---I'd need to work through the unescaping that would be required if minidom or ElementTree were used---but I'll leave it for now. On the bright side, it's 266 lines of code I don't need to read...

...which of course turns out not to be true. There are just enough differences between Nanodom and [name of your favorite XML library goes here] that I do actually have to read the code. I won't include all of the source here, but here are the highlights.

1. two tables of regular expressions called ENTITY_NORMALIZATION_EXPRESSIONS and ENTITY_NORMALIZATION_EXPRESSIONS_SOFT. The only difference is the way ampersands are matched: the first uses "&amp;", while the second uses "&amp;(?!\#)". The sub-expression "(?!\#)" means "match only if not followed by a literal # sign". The two tables are used in the Document.normalizeEntities method:

0152: class Document:
          ...
0192:     def normalizeEntities(self, text, avoidDoubleNormalizing=False):
0193: 
0194:         if avoidDoubleNormalizing:
0195:             regexps = ENTITY_NORMALIZATION_EXPRESSIONS_SOFT
0196:         else:
0197:             regexps = ENTITY_NORMALIZATION_EXPRESSIONS
0198: 
0199:         for regexp, substitution in regexps:
0200:             text = regexp.sub(substitution, text)
0201:         return text

At this point, I have no idea what the difference is; it's another question to send to the module's authors.

2. Document.__init__ sets self.bidi to "ltr". Its setBidi method is then:

0163:     def setBidi(self, bidi):
0164:         if bidi:
0165:             self.bidi = bidi

which matches my earlier guess about how bidirectional flags are being handled.

3. Lines 179--182 create a cache of entity references:

0179:     def createEntityReference(self, entity):
0180:         if entity not in self.entities:
0181:             self.entities[entity] = EntityReference(entity)
0182:         return self.entities[entity]

The only place this is used is in AutomailPattern.handleMatch:

0826: class AutomailPattern (Pattern):
0827: 
0828:     def handleMatch(self, m, doc):
0829:         el = doc.createElement('a')
0830:         email = m.group(2)
0831:         if email.startswith("mailto:"):
0832:             email = email[len("mailto:"):]
0833:         for letter in email:
0834:             entity = doc.createEntityReference("#%d" % ord(letter))
0835:             el.appendChild(entity)
0836:         mailto = "mailto:" + email
0837:         mailto = "".join(['&#%d;' % ord(letter) for letter in mailto])
0838:         el.setAttribute('href', mailto)
0839:         return el

The first time I read this, I had no idea what it was doing. I mean, I could tell that it was replacing each character in a mailto: link with a character references, but I had no idea why. Then I came across this while looking through the Markdown syntax spec for something completely different:


Markdown supports a shortcut style for creating "automatic" links for
URLs and email addresses: simply surround the URL or email address
with angle brackets. What this means is that if you want to show the
actual text of a URL or email address, and also have it be a clickable
link, you can do this:

<http://example.com/>

Markdown will turn this into:

<a href="http://example.com/">http://example.com/</a>

Automatic links for email addresses work similarly, except that
Markdown will also perform a bit of randomized decimal and hex
entity-encoding to help obscure your address from address-harvesting
spambots. For example, Markdown will turn this:

<address@example.com>

into something like this:

<a href="&#x6D;&#x61;i&#x6C;&#x74;&#x6F;:&#x61;&#x64;&#x64;&#x72;&#x65;
&#115;&#115;&#64;&#101;&#120;&#x61;&#109;&#x70;&#x6C;e&#x2E;&#99;&#111;
&#109;">&#x61;&#x64;&#x64;&#x72;&#x65;&#115;&#115;&#64;&#101;&#120;&#x61;
&#109;&#x70;&#x6C;e&#x2E;&#99;&#111;&#109;</a>

which will render in a browser as a clickable link to "address@example.com".

All right: I doubt it's particularly effective, but it's another 20-odd lines I now understand.

4. The Element class contains this:

0225: class Element:
          ...
0238:     def setBidi(self, bidi):
0239:         if bidi:
0240:             orig_bidi = self.bidi
0241:             if not self.bidi or self.isDocumentElement:
0242:                 # Once the bidi is set don't change it (except for doc element)
0243:                 self.bidi = bidi
0244:                 self.parent.setBidi(bidi)

In other words, only the BIDI of the root document element can be changed, and when it is, it changes the BIDI of its parent (i.e., the document itself). This feels hacky, and I don't know what orig_bidi is for (a leftover from a debugging session?).

5. Element.toxml uses the expected recursive node-to-string-and-concatenate approach. It sometimes inserts an extra newline (presumably to make the generated HTML more readable), but there are a couple of things that are more puzzling. Number one is this, right at the top of the method:

0286:     def toxml(self):
0287:         if ENABLE_ATTRIBUTES:
0288:             for child in self.childNodes:
0289:                 child.handleAttributes()
          ...

I still don't understand what this is about; really need to ask the authors. The second is further down: after the children of this node have been processed, there is a block of code that reads:

0311:         buffer += "<" + self.nodeName
0312:         if self.nodeName in ['p', 'li', 'ul', 'ol',
0313:                              'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
0314:             if not self.attribute_values.has_key("dir"):
0315:                 if self.bidi:
0316:                     bidi = self.bidi
0317:                 else:
0318:                     bidi = self.doc.bidi
0319:                 if bidi=="rtl":
0320:                     self.setAttribute("dir", "rtl")
0321: 
0322:         for attr in self.attributes:
0323:             value = self.attribute_values[attr]
0324:             value = self.doc.normalizeEntities(value,
0325:                                                avoidDoubleNormalizing=True)
0326:             buffer += ' %s="%s"' % (attr, value)

I've been at this for about an hour now, and I can feel myself getting tired, so I'm going to draw a big circle around this, put a question mark beside it, and come back to it later. It's obviously something to do with bidirectionality---specifying the direction of a block of text, maybe---but then there's entity normalization and a call to normalizeEntities that sets avoidDoubleNormalizing to True to force it to use the end-limited regular expression for matching ampersands. Wait a second: the syntax for a character entity in HTML is &#dddd, where dddd represents some hexadecimal digits. Does this mean that the flag and extra table are there to prevent us escaping the leading ampersand of character entities we've already generated?

6. Then I get to the TextNode class:

0337: class TextNode:
0338: 
0339:     type = "text"
0340:     attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123}
0341: 
0342:     def __init__ (self, text):
0343:         self.value = text
0344: 
0345:     def attributeCallback(self, match):
0346: 
0347:         self.parent.setAttribute(match.group(1), match.group(2))
0348: 
0349:     def handleAttributes(self):
0350:         self.value = self.attrRegExp.sub(self.attributeCallback, self.value)
0351: 
0352:     def toxml(self):
0353:         text = self.value
0354:         self.parent.setBidi(getBidiType(text))
0355:         if not text.startswith(HTML_PLACEHOLDER_PREFIX):
0356:             if self.parent.nodeName == "p":
0357:                 text = text.replace("\n", "\n   ")
0358:             elif (self.parent.nodeName == "li"
0359:                   and self.parent.childNodes[0]==self):
0360:                 text = "\n     " + text.replace("\n", "\n     ")
0361:         text = self.doc.normalizeEntities(text)
0362:         return text

There's a couple of goodies in this:

  1. The comment on line 340 drove me to Google for "python markdown attribute", which led me to this email message, which tells me that {@foo=bar} in a block of text in Markdown adds foo as an attribute to the enclosing node with "bar" as a value. Nice hack, and now that I know it's there, other things I've seen make sense.
  2. TextNode.toxml sets the BIDI of the parent node to whatever's appropriate for this text, but as I've already discovered, that's ignored unless we're at the root. Hm...

Preprocessors

Reading the Nanodom code gave me a better idea of how the parser as a whole is going to work, in the same way that the tools someone hands you tells you if you're going to be doing carpentry or brain surgery. With it out of the way, I hit this:

0379: """
0380: ======================================================================
0381: ========================== PRE-PROCESSORS ============================
0382: ======================================================================
0383: 
0384: Preprocessors munge source text before we start doing anything too
0385: complicated.
0386: 
0387: Each preprocessor implements a "run" method that takes a pointer to a
0388: list of lines of the document, modifies it as necessary and returns
0389: either the same pointer or a pointer to a new list.  Preprocessors
0390: must extend markdown.Preprocessor.
0391: 
0392: """
0392:
0392:
0395: class Preprocessor:
0396:     pass

I'm not sure why preprocessors have to extend Preprocessor---Python uses duck typing, so anything with a run method that takes and returns a list of lines ought to be usable---but that's a minor point. What really matters is that this description of preprocessors neatly extends the mental model I'm building of the parser's operation: instead of text->DOM->HTML, it's now text->lines->DOM->HTML. Take a look at the first preprocessor:

0399: class HeaderPreprocessor (Preprocessor):
0400:     """
0401:     Replaces underlined headers with hashed headers to avoid
0402:     the nead for lookahead later.
0403:     """
0404: 
0405:     def run (self, lines):
0406: 
0407:         i = -1
0408:         while i+1 < len(lines):
0409:             i = i+1
0410:             if not lines[i].strip():
0411:                 continue
0412: 
0413:             if lines[i].startswith("#"):
0414:                 lines.insert(i+1, "\n")
0415: 
0416:             if (i+1 <= len(lines)
0417:                   and lines[i+1]
0418:                   and lines[i+1][0] in ['-', '=']):
0419: 
0420:                 underline = lines[i+1].strip()
0421: 
0422:                 if underline == "="*len(underline):
0423:                     lines[i] = "# " + lines[i].strip()
0424:                     lines[i+1] = ""
0425:                 elif underline == "-"*len(underline):
0426:                     lines[i] = "## " + lines[i].strip()
0427:                     lines[i+1] = ""
0428: 
0429:         return lines
0430: 
0431: HEADER_PREPROCESSOR = HeaderPreprocessor()

It's immediately obvious what this does---if you've read the Markdown documentation, and know that it allows h1 headers to be written as either:

# Title  or  Title
=====

and h2 headers as:

## Title ##  or  Title
-----

Knowing that, this code is pretty easy to follow: it uses one-line lookahead to find cases of underlined headers, and turns them into their hash-prefixed equivalents, so that later processing can treat lines as units. I'm not sure that starting the index at -1 instead of the more conventional 0 makes anything simpler, but never mind---somewhere further down in the file, the singleton instance of this class that's assigned to HEADER_PREPROCESSOR is going to take the list of lines and (possibly) shrink it down a little.

The next preprocessor is a little more complicated. Here's the class header and run method:

0433: class LinePreprocessor (Preprocessor):
0434:     """Deals with HR lines (needs to be done before processing lists)"""
0435: 
0436:     blockquote_re = re.compile(r'^(> )+')
0437: 
0438:     def run (self, lines):
0439:         for i in range(len(lines)):
0440:             prefix = ''
0441:             m = self.blockquote_re.search(lines[i])
0442:             if m : prefix = m.group(0)
0443:             if self._isLine(lines[i][len(prefix):]):
0444:                 lines[i] = prefix + self.stash.store("<hr />", safe=True)
0445:         return lines

The docstring tells me that its job is to handle horizontal rules---the common hr tag. If that's the case, though, why does it start by looking for block quotes, which are written in Markdown like this:

> This bunch of lines will be
> indented as a block quote.

I think what's happening is that it's expanding horizontal rulers in block quotes as well: each one is stored in self.stash, presumably to be put back later. I don't understand what ":w" means in <:wHR> on line 448---a typo?---but it's not executable, so I won't worry about it now.

There is something more important thing in this code that I don't understand. That variable self.stash---where did it come from? It's not created in LinePreprocessor.__init__, because there is no such method. And it isn't in Preprocessor.__init__, because it doesn't exist either. What's going on?

It turns out that I have to look ahead almost eight hundred lines to the main class in the program to find it:

1099: class Markdown:
1100:     """
1101:     Markdown formatter class for creating an html document from Markdown text.
1102:     """
          ...
1203:     def reset(self):
1204:         """
1205:         Resets all state variables so that we can start with a new text.
1206:         """
1207:         self.references={}
1208:         self.htmlStash = HtmlStash()
1209: 
1210:         HTML_BLOCK_PREPROCESSOR.stash = self.htmlStash
1211:         LINE_PREPROCESSOR.stash = self.htmlStash
1212:         REFERENCE_PREPROCESSOR.references = self.references
1213:         HTML_PATTERN.stash = self.htmlStash
1214:         ENTITY_PATTERN.stash = self.htmlStash
1215:         REFERENCE_PATTERN.references = self.references
1216:         IMAGE_REFERENCE_PATTERN.references = self.references
1217:         RAWHTMLTEXTPOSTPROCESSOR.stash = self.htmlStash
1218:         RAWHTMLTEXTPOSTPROCESSOR.safeMode = self.safeMode
1219: 
1220:         for extension in self.registeredExtensions:
1221:             extension.reset()

That's legal, but it's not pretty. Python doesn't distinguish between "public" and "private" class members like C++ and Java do (but see the discussion below). Any piece of code can add, change, or delete any member of any object simply by assigning to it. It is almost universally considered bad style, though, since it means that people reading a class's definition can't know what instances of that class actually consist of at runtime. It also makes the code very fragile: if I tried to re-use LinePreprocessor in some other application, it would almost certainly fail.

I think it would be cleaner to add a reset method to Preprocessor that does nothing:

class Preprocessor:
    def reset(self, **kwargs):
        pass

and then have each derived class override it as appropriate:

class LinePreprocessor(Preprocessor):
    def reset(self, **kwargs):
        self.stash = kwargs['stash']

Lines 1210--1218 would then become:

for preproc in [HTML_BLOCK_PREPROCESSOR,
                LINE_PREPROCESSOR,
                HTML_PATTERN,
                ENTITY_PATTERN]:
    preproc.reset(stash=self.htmlStash)

for preproc in [REFERENCE_PREPROCESSOR,
                REFERENCE_PATTERN,
                IMAGE_REFERENCE_PATTERN]:
    preproc.reset(references=self.references)

RAWHTMLTEXTPOSTPROCESSOR.reset(stash=self.htmlStash, safeMode=self.safeMode)

This is a lot safer for future programmers to extend: just figure out what kind of preprocessor you have, and stick the singleton in the appropriate list. It would mean moving the definition of the HtmlStash class higher in the code, but that doesn't look like a problem, since it doesn't depend on anything else:

0941: class HtmlStash:
0942:     """
0943:     This class is used for stashing HTML objects that we extract
0944:     in the beginning and replace with place-holders.
0945:     """
0946: 
0947:     def __init__ (self):
0948:         self.html_counter = 0 # for counting inline html segments
0949:         self.rawHtmlBlocks=[]
0950: 
0951:     def store(self, html, safe=False):
0952:         """
0953:         Saves an HTML segment for later reinsertion.  Returns a
0954:         placeholder string that needs to be inserted into the
0955:         document.
0956:         @param html: an html segment
0957:         @param safe: label an html segment as safe for safemode
0958:         @param inline: label a segmant as inline html
0959:         @returns : a placeholder string
0960:         """
0961:         self.rawHtmlBlocks.append((html, safe))
0962:         placeholder = HTML_PLACEHOLDER % self.html_counter
0963:         self.html_counter += 1
0964:         return placeholder

Going back to LinePreprocessor, its other member is a utility method called _isLine. The leading underscore is as close as Python gets to information hiding; see the docs for more information. But look at this method's body:

0447:     def _isLine(self, block):
0448:         """Determines if a block should be replaced with an <:wHR>"""
0449:         if block.startswith("    "): return 0  # a code block
0450:         text = "".join([x for x in block if not x.isspace()])
0451:         if len(text) <= 2:
0452:             return 0
0453:         for pattern in ['isline1', 'isline2', 'isline3']:
0454:             m = RE.regExp[pattern].match(text)
0455:             if (m and m.group(1)):
0456:                 return 1
0457:         else:
0458:             return 0

The indentation for code blocks is hardwired on line 449; I'm pretty sure that will show up somewhere else in the code, so it ought to be lifted up to the constants section at the top of the file. I don't know what the <=2 test on lines 451--452 is for: ignoring single and double dashes, maybe, so that only three or more counts as a ruler? And what is the RE variable on line 454? A search forward finds this 500-odd lines later:

1096: RE = CorePatterns()

and a search backward from there finds this:

1067: class CorePatterns:
1068:     """
1069:     This class is scheduled for removal as part of a refactoring
1070:     effort.
1071:     """
1072: 
1073:     patterns = {
1074:         'header':          r'(#*)([^#]*)(#*)', # # A title
1075:         'reference-def':   r'(\ ?\ ?\ ?)\[([^\]]*)\]:\s*([^ ]*)(.*)',
1076:                            # [Google]: http://www.google.com/
1077:         'containsline':    r'([-]*)$|^([=]*)', # -----, =====, etc.
1078:         'ol':              r'[ ]{0,3}[\d]*\.\s+(.*)', # 1. text
1079:         'ul':              r'[ ]{0,3}[*+-]\s+(.*)', # "* text"
1080:         'isline1':         r'(\**)', # ***
1081:         'isline2':         r'(\-*)', # ---
1082:         'isline3':         r'(\_*)', # ___
1083:         'tabbed':          r'((\t)|(    ))(.*)', # an indented line
1084:         'quoted':          r'> ?(.*)', # a quoted block ("> ...")
1085:     }
1086: 
1087:     def __init__ (self):
1088: 
1089:         self.regExp = {}
1090:         for key in self.patterns.keys():
1091:             self.regExp[key] = re.compile("^%s$" % self.patterns[key],
1092:                                           re.DOTALL)
1093: 
1094:         self.regExp['containsline'] = re.compile(r'^([-]*)$|^([=]*)$', re.M)
1095: 
1096: RE = CorePatterns()

I don't understand why the 'containsline' pattern is defined (line 1077), and the 'quoted' pattern duplicates the one on line 436 in LinePreprocessor, but the docstring does say the class is scheduled for refactoring. Anyway, lines 1080--1082 confirm my earlier guess: lines less than two characters long are being weeded out because a horizontal rule has to be at least three characters long. But then why not rewrite the pattern so that it only matches three or more of the character? And why not fold the three patterns into one, so that _isLine doesn't need to loop? It doesn't matter right now: I understand this code well enough to move on.

The next preprocessor handles HTML blocks. It's the biggest of the bunch, and I'm feeling a little punchy, so I put a question mark beside it and skip ahead to the reference preprocessor, which is shorter and (hopefully) simpler. The code is:

0561: class ReferencePreprocessor (Preprocessor):
0562: 
0563:     def run (self, lines):
0564: 
0565:         new_text = [];
0566:         for line in lines:
0567:             m = RE.regExp['reference-def'].match(line)
0568:             if m:
0569:                 id = m.group(2).strip().lower()
0570:                 t = m.group(4).strip()  # potential title
0571:                 if not t:
0572:                     self.references[id] = (m.group(3), t)
0573:                 elif (len(t) >= 2
0574:                       and (t[0] == t[-1] == "\""
0575:                            or t[0] == t[-1] == "\'"
0576:                            or (t[0] == "(" and t[-1] == ")") ) ):
0577:                     self.references[id] = (m.group(3), t[1:-1])
0578:                 else:
0579:                     new_text.append(line)
0580:             else:
0581:                 new_text.append(line)
0582: 
0583:         return new_text #+ "\n"
0584: 
0585: REFERENCE_PREPROCESSOR = ReferencePreprocessor()

There's no docstring, but line 567 does refer to a regular expression called 'reference-def' in the RE variable I found while figuring out the previous preprocessor. Just to refresh your memory, its definition is:

1075: 'reference-def': r'(\ ?\ ?\ ?)\[([^\]]*)\]:\s*([^ ]*)(.*)', 1076: # [Google]: http://www.google.com/

The regexp itself is too complicated for me to figure out right away (three leading spaces, all optional---what's that about?), but the example below seems clear enough: text in square brackets, followed by a colon, followed by optional space, followed by anything that isn't a space character. Searching for "reference" in the Markdown syntax spec turns up this:


Reference-style links use a second set of square brackets, inside
which you place a label of your choosing to identify the link:

This is [an example][id] reference-style link.

You can optionally use a space to separate the sets of brackets:

This is [an example] [id] reference-style link.

Then, anywhere in the document, you define your link label like this,
on a line by itself:

[id]: http://example.com/  "Optional Title Here"

It looks like Markdown-1 is being a little more general than this---lines 574--576 seem to allow the title to be in double quotes, single quotes, or parentheses---but I feel safe adding this class to my growing model of the software. The only question I scribble down is, "Why does this create a new list, instead of removing lines from the existing list?"

Before diving into the last of the preprocessors, it's worth reviewing how I've been reading this code. I had a rough idea of how it worked before I started, thanks to comments at the head of the file and my own understanding of how text filters are usually written. I've been filling in that model by reading blocks of code, chasing down references and going back to the Markdown syntax documentation whenever I hit something I can't figure out. I've been making lots of marks on a hard copy of the code (printed landscape so that I have plenty of room to scribble). I can't really call them "notes", since most of them are just tick marks, question marks, or circles around bits of code, but the more lines I mark, the faster I read.

So, on to the HtmlBlockPreprocessor. It starts with four utility methods, none of which are documented or particularly readable:

0463: class HtmlBlockPreprocessor (Preprocessor):
0464:     """Removes html blocks from self.lines"""
0465: 
0466:     def _get_left_tag(self, block):
0467:         return block[1:].replace(">", " ", 1).split()[0].lower()
0468: 
0469: 
0470:     def _get_right_tag(self, left_tag, block):
0471:         return block.rstrip()[-len(left_tag)-2:-1].lower()
0472: 
0473:     def _equal_tags(self, left_tag, right_tag):
0474: 
0475:         if left_tag in ['?', '?php', 'div']: # handle PHP, etc.
0476:             return True
0477:         if ("/" + left_tag) == right_tag:
0478:             return True
0479:         if (right_tag == "--" and left_tag == "--"):
0480:             return True
0481:         elif left_tag == right_tag[1:] \
0482:             and right_tag[0] != "<":
0483:             return True
0484:         else:
0485:             return False
0486: 
0487:     def _is_oneliner(self, tag):
0488:         return (tag in ['hr', 'hr/'])

I'm going to trust that _get_left_tag and _get_right_tag do what their names suggest---I'm not happy with lengthy expressions like the one on line 467, but the program does run. _equal_tags is more problematic: do we really not care what the right tag is if the left tag is <?php> or <div>? And _is_oneliner is only checking for rulers, so presumably "one liner" means "one line across the rendered page", not "an expression that only takes up one line".

But onward: the run method starts like this:

0491:     def run (self, text):
0492: 
0493:         new_blocks = []
0494:         text = text.split("\n\n")

Whoa whoa whoa---what's with text.split? I thought preprocessors were given a list of lines as input? A search for HtmlBlockPreprocessor turns up exactly one use right after the class's definition:

0558: HTML_BLOCK_PREPROCESSOR = HtmlBlockPreprocessor()

That variable is used in just one place, in the main Markdown class:

1099: class Markdown:
          ...
1105:     def __init__(self, source=None,  # deprecated
1106:                  extensions=[],
1107:                  extension_configs=None,
1108:                  safe_mode = False):
          ...
1129:         self.textPreprocessors = [HTML_BLOCK_PREPROCESSOR]
1130: 
1131:         self.preprocessors = [HEADER_PREPROCESSOR,
1132:                               LINE_PREPROCESSOR,
1133:                               # A footnote preprocessor will
1134:                               # get inserted here
1135:                               REFERENCE_PREPROCESSOR]

Ah ha---the documentation was misleading. Despite being derived from Preprocessor, HtmlBlockPreprocessor actually works on the text as a whole, rather than on a list of lines. That makes the rest of run a little easier to---no, it's still hard to figure out. The text input is being turned into a list of lines by splitting on double newlines, which I presume means paragraph breaks (but if two paragraphs are separated by a line containing nothing but spaces and tabs, they'll be considered one paragraph). The result is being assigned back to text (line 494), which I think is bad style, since it changes the type of the variable, invalidating the unwary reader's assumptions about it in subsequent code.

Here's the rest of the method:

0496:         items = []
0497:         left_tag = ''
0498:         right_tag = ''
0499:         in_tag = False # flag
0500: 
0501:         for block in text:
0502:             if block.startswith("\n"):
0503:                 block = block[1:]
0504: 
0505:             if not in_tag:
0506: 
0507:                 if block.startswith("<"):
0508: 
0509:                     left_tag = self._get_left_tag(block)
0510:                     right_tag = self._get_right_tag(left_tag, block)
0511: 
0512:                     if not (is_block_level(left_tag) \
0513:                         or block[1] in ["!", "?", "@", "%"]):
0514:                         new_blocks.append(block)
0515:                         continue
0516: 
0517:                     if self._is_oneliner(left_tag):
0518:                         new_blocks.append(block.strip())
0519:                         continue
0520: 
0521:                     if block[1] == "!":
0522:                         # is a comment block
0523:                         left_tag = "--"
0524:                         right_tag = self._get_right_tag(left_tag, block)
0525:                         # keep checking conditions below and maybe just append
0526: 
0527:                     if block.rstrip().endswith(">") \
0528:                         and self._equal_tags(left_tag, right_tag):
0529:                         new_blocks.append(
0530:                             self.stash.store(block.strip()))
0531:                         continue
0532:                     else: #if not block[1] == "!":
0533:                         # if is block level tag and is not complete
0534:                         items.append(block.strip())
0535:                         in_tag = True
0536:                         continue
0537: 
0538:                 new_blocks.append(block)
0539: 
0540:             else:
0541:                 items.append(block.strip())
0542: 
0543:                 right_tag = self._get_right_tag(left_tag, block)
0544: 
0545:                 if self._equal_tags(left_tag, right_tag):
0546:                     # if find closing tag
0547:                     in_tag = False
0548:                     new_blocks.append(
0549:                         self.stash.store('\n\n'.join(items)))
0550:                     items = []
0551: 
0552:         if items:
0553:             new_blocks.append(self.stash.store('\n\n'.join(items)))
0554:             new_blocks.append('\n')
0555: 
0556:         return "\n\n".join(new_blocks)

Quick, what does the if on line 505 line up with? The answer is the else on line 540, but 35 lines is a long way to look for a match---I'd probably put the bodies of the if and else in helper methods. I'd probably also switch the order around: rather than "if not in tag...else...", I'd use "if in tag...else...", since non-negated logic is easier for people to read. I'd probably also strip all leading newlines, rather than just one (lines 502--503), and---is the call to _get_right_tag on line 510 guaranteed to find a matching right tag, or just the next one? Either way, why does the method re-set right_tag on line 524 if the opening tag is "<!" (i.e., a comment)? Why use a list on line 513? Why not check if block[1] is in a string (which would be faster)? How does the use of "?" here interact with the test on line 475 of _equal_tags that I saw earlier?

None of these questions are really important right now: as I've said before, the code passes its tests, and I'm not planning to rewrite it. But the temptation to refactor is often strong when you're going through someone else's code. No two programmers do things exactly the same way, and the more you understand the stuff you're reading, the stronger the temptation to "fix" it.

Resist. Life is short, code is long, and if you start down that path, you'll eventually find yourself writing your own programming language because you're not perfectly happy with anyone else's. I want a wiki processor for DrProject that I don't have to maintain, and whose syntax some users are likely to already know. I don't understand this code completely, but I now know enough to move on.

Postprocessors

The next big chunk of code is titled "Inline Patterns", but I'm feeling tired again, so I skip ahead looking for something easier. The very next block is:

0871: """
0872: ======================================================================
0873: ========================== POST-PROCESSORS ===========================
0874: ======================================================================
0875: 
0876: Markdown also allows post-processors, which are similar to
0877: preprocessors in that they need to implement a "run" method.  Unlike
0878: pre-processors, they take a NanoDom document as a parameter and work
0879: with that.
0880: 
0881: Post-Processor should extend markdown.Postprocessor.
0882: 
0883: There are currently no standard post-processors, but the footnote
0884: extension below uses one.
0885: """
0886: 
0887: class Postprocessor:
0888:     pass

It doesn't get much simpler than that. Since it only took a few seconds to read the blurb, I look ahead to the next block:

0891: """
0892: ======================================================================
0893: ======================== TEXT-POST-PROCESSORS ========================
0894: ======================================================================
0895: 
0896: Markdown also allows text-post-processors, which are similar to
0897: textpreprocessors in that they need to implement a "run" method.
0898: Unlike post-processors, they take a text string as a parameter and
0899: should return a string.
0900: 
0901: Text-Post-Processors should extend markdown.Postprocessor.
0902: """

Oops: this smells like a violation of the Liskov Substitution Principle. A postprocessor's run method is supposed to take a Nanodom document as input; here, though, we have a derived class whose override of that method takes a different kind of object as input. It would be cleaner and safer to define two kinds of postprocessors. But that aside, the only text postprocessor in the module is pretty easy to understand (even without a docstring):

0905: class RawHtmlTextPostprocessor(Postprocessor):
0906: 
0907:     def __init__(self):
0908:         pass
0909: 
0910:     def run(self, text):
0911:         for i in range(self.stash.html_counter):
0912:             html, safe  = self.stash.rawHtmlBlocks[i]
0913:             if self.safeMode and not safe:
0914:                 if str(self.safeMode).lower() == 'escape':
0915:                     html = self.escape(html)
0916:                 elif str(self.safeMode).lower() == 'remove':
0917:                     html = ''
0918:                 else:
0919:                     html = HTML_REMOVED_TEXT
0920: 
0921:             text = text.replace("<p>%s\n</p>" % (HTML_PLACEHOLDER % i),
0922:                               html + "\n")
0923:             text =  text.replace(HTML_PLACEHOLDER % i, html)
0924:         return text
0925: 
0926:     def escape(self, html):
0927:         ''' Basic html escaping '''
0928:         html = html.replace('&', '&amp;')
0929:         html = html.replace('<', '&lt;')
0930:         html = html.replace('>', '&gt;')
0931:         return html.replace('"', '&quot;')
0932: 
0933: RAWHTMLTEXTPOSTPROCESSOR = RawHtmlTextPostprocessor()

I read through this class earlier; it's just replacing HTML placeholders with the text that was originally in the document. I'm a little worried about that escape method, though; is this really the only place where HTML gets escaped? A quick search doesn't turn up any other suspicious calls to replace or escape (I search by looking for the strings "replace(" and "escape(", since the words themselves appear many other times in the code), so I make a note and take a break.

Inline Patterns

I've spent a couple of hours reading this code so far, and have covered about a thousand lines. This is in keeping with the estimates and advice in the Smart Bear book, but I'm going to have to pick up the pace a bit if I want to finish this afternoon. The next chunk in the file is inline patterns; as the header comment explains:

0587: """
0588: ======================================================================
0589: ========================== INLINE PATTERNS ===========================
0590: ======================================================================
0591: 
0592: Inline patterns such as *emphasis* are handled by means of auxiliary
0593: objects, one per pattern.  Pattern objects must be instances of classes
0594: that extend markdown.Pattern.  Each pattern object uses a single regular
0595: expression and needs support the following methods:
0596: 
0597:   pattern.getCompiledRegExp() - returns a regular expression
0598: 
0599:   pattern.handleMatch(m, doc) - takes a match object and returns
0600:                                 a NanoDom node (as a part of the provided
0601:                                 doc) or None
0602: 
0603: All of python markdown's built-in patterns subclass from Patter,
0604: but you can add additional patterns that don't.
0605: 
0606: Also note that all the regular expressions used by inline must
0607: capture the whole block.  For this reason, they all start with
0608: '^(.*)' and end with '(.*)!'.  In case with built-in expression
0609: Pattern takes care of adding the "^(.*)" and "(.*)!".
0610: 
0611: Finally, the order in which regular expressions are applied is very
0612: important - e.g. if we first replace http://.../ links with <a> tags
0613: and _then_ try to replace inline html, we would end up with a mess.
0614: So, we apply the expressions in the following order:
0615: 
0616:        * escape and backticks have to go before everything else, so
0617:          that we can preempt any markdown patterns by escaping them.
0618: 
0619:        * then we handle auto-links (must be done before inline html)
0620: 
0621:        * then we handle inline HTML.  At this point we will simply
0622:          replace all inline HTML strings with a placeholder and add
0623:          the actual HTML to a hash.
0624: 
0625:        * then inline images (must be done before links)
0626: 
0627:        * then bracketed links, first regular then reference-style
0628: 
0629:        * finally we apply strong and emphasis
0630: """

That seems simple enough. The next block is just a bunch of regular expressions with more-or-less comprehensible names and examples in comments---the examples are particularly welcome:

0632: NOBRACKET = r'[^\]\[]*'
0633: BRK = ( r'\[('
0634:         + (NOBRACKET + r'(\[')*6
0635:         + (NOBRACKET+ r'\])*')*6
0636:         + NOBRACKET + r')\]' )
0637: NOIMG = r'(?<!\!)'
0638: 
0639: BACKTICK_RE = r'\`([^\`]*)\`'                    # `e= m*c^2`
0640: DOUBLE_BACKTICK_RE =  r'\`\`(.*)\`\`'            # ``e=f("`")``
0641: ESCAPE_RE = r'\\(.)'                             # \<
0642: EMPHASIS_RE = r'\*([^\*]*)\*'                    # *emphasis*
0643: STRONG_RE = r'\*\*(.*)\*\*'                      # **strong**
0644: STRONG_EM_RE = r'\*\*\*([^_]*)\*\*\*'            # ***strong***
0645: 
0646: if SMART_EMPHASIS:
0647:     EMPHASIS_2_RE = r'(?<!\S)_(\S[^_]*)_'        # _emphasis_
0648: else:
0649:     EMPHASIS_2_RE = r'_([^_]*)_'                 # _emphasis_
0650: 
0651: STRONG_2_RE = r'__([^_]*)__'                     # __strong__
0652: STRONG_EM_2_RE = r'___([^_]*)___'                # ___strong___
0653: 
0654: LINK_RE = NOIMG + BRK + r'\s*\(([^\)]*)\)'               # [text](url)
0655: LINK_ANGLED_RE = NOIMG + BRK + r'\s*\(<([^\)]*)>\)'      # [text](<url>)
0656: IMAGE_LINK_RE = r'\!' + BRK + r'\s*\(([^\)]*)\)' # ![alttxt](http://x.com/)
0657: REFERENCE_RE = NOIMG + BRK+ r'\s*\[([^\]]*)\]'           # [Google][3]
0658: IMAGE_REFERENCE_RE = r'\!' + BRK + '\s*\[([^\]]*)\]' # ![alt text][2]
0659: NOT_STRONG_RE = r'( \* )'                        # stand-alone * or _
0660: AUTOLINK_RE = r'<(http://[^>]*)>'                # <http://www.123.com>
0661: AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>'               # <me@example.com>
0662: #HTML_RE = r'(\<[^\>]*\>)'                        # <...>
0663: HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)'               # <...>
0664: ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)'                # &amp;
0665: LINE_BREAK_RE = r'  \n'                     # two spaces at end of line
0666: LINE_BREAK_2_RE = r'  $'                    # two spaces at end of text

Next up is a class called Pattern compiles and stores a regular expression:

0669: class Pattern:
0670: 
0671:     def __init__ (self, pattern):
0672:         self.pattern = pattern
0673:         self.compiled_re = re.compile("^(.*)%s(.*)$" % pattern, re.DOTALL)
0674: 
0675:     def getCompiledRegExp (self):
0676:         return self.compiled_re
0677:
0678: BasePattern = Pattern # for backward compatibility

I'm not sure why the author bothered; a compiled regexp is a perfectly good object, and its pattern member stores the text used to create it, so maybe the point is to wrap the regexp in (.*) (which matches zero or more of anything)? Reading on, there are three derived class:

0681: class SimpleTextPattern (Pattern):
0682: 
0683:     def handleMatch(self, m, doc):
0684:         return doc.createTextNode(m.group(2))
0685:
0686:
0687: class SimpleTagPattern (Pattern):
0688: 
0689:     def __init__ (self, pattern, tag):
0690:         Pattern.__init__(self, pattern)
0691:         self.tag = tag
0692: 
0693:     def handleMatch(self, m, doc):
0694:         el = doc.createElement(self.tag)
0695:         el.appendChild(doc.createTextNode(m.group(2)))
0696:         return el
0697:
0698:
0699: class SubstituteTagPattern (SimpleTagPattern):
0700: 
0701:     def handleMatch (self, m, doc):
0702:         return doc.createElement(self.tag)

Docstrings would have been nice; as it is, I have to guess why there are three, and how each will be used. (As an aside, wouldn't it be great if we could put class diagrams in source code, so that we could see the relationships between classes like this at a glance? I've written about this before; I'm still waiting...)

Onward to a class called BacktickPattern:

0705: class BacktickPattern (Pattern):
0706: 
0707:     def __init__ (self, pattern):
0708:         Pattern.__init__(self, pattern)
0709:         self.tag = "code"
0710: 
0711:     def handleMatch(self, m, doc):
0712:         el = doc.createElement(self.tag)
0713:         text = m.group(2).strip()
0714:         #text = text.replace("&", "&amp;")
0715:         el.appendChild(doc.createTextNode(text))
0716:         return el

The fact that line 714 is commented out worries me, but everything else seems pretty clear. It would be easier to understand if the regexp was right here, so that I could read it while I'm reading the class that uses it. Instead, I have to search forward to find out that this class is actually used to wrap two patterns:

0845: BACKTICK_PATTERN        = BacktickPattern(BACKTICK_RE)
0846: DOUBLE_BACKTICK_PATTERN = BacktickPattern(DOUBLE_BACKTICK_RE)

I then search backward to find their definitions (which I skimmed over, but didn't remember, less than two minutes ago):

0639: BACKTICK_RE = r'\`([^\`]*)\`'                    # `e= m*c^2`
0640: DOUBLE_BACKTICK_RE =  r'\`\`(.*)\`\`'            # ``e=f("`")``

and then forward again to find that the pattern objects are used to build a lookup table in the constructor of the main Markdown class:

1099: class Markdown:
          ...
1105:     def __init__(self, source=None,  # deprecated
1106:                  extensions=[],
1107:                  extension_configs=None,
1108:                  safe_mode = False):
          ...
1146:         self.inlinePatterns = [DOUBLE_BACKTICK_PATTERN,
1147:                                BACKTICK_PATTERN,

A quick surf through the Markdown syntax documentation tells me that single backticks are used to format inline code blocks, while double backticks are used when the code block contains single backticks.

I think this code is ripe for refactoring. As a first step, it, and many of its siblings, could be put into a table like this:

stuff = (
    (r'\`([^\`]*)\`', BacktickPattern),
    (r'\`\`(.*)\`\`', BacktickPattern),
    ...
)

The Markdown class's inlinePatterns member could then be built as:

    self.inlinePatterns = []
    for (pat, cls) in stuff:
        self.inlinePatterns.append(cls(pat))

There are plenty of other ways to do it as well: for example, store the patterns to be used with each matching class in a class variable, so that they're right there for reading, then have each class self-evaluate to create its own instances. These aren't just tricks; instead, they take advantage of the fact that Python is a dynamic language, and that it's easy to store things like class objects in data structures.

But I will resist the temptation and press on to the DoubleTagPattern class:

0719: class DoubleTagPattern (SimpleTagPattern):
0720: 
0721:     def handleMatch(self, m, doc):
0722:         tag1, tag2 = self.tag.split(",")
0723:         el1 = doc.createElement(tag1)
0724:         el2 = doc.createElement(tag2)
0725:         el1.appendChild(el2)
0726:         el2.appendChild(doc.createTextNode(m.group(2)))
0727:         return el1

Its use:

0852: STRONG_EM_PATTERN       = DoubleTagPattern(STRONG_EM_RE, 'strong,em')
0853: STRONG_EM_PATTERN_2     = DoubleTagPattern(STRONG_EM_2_RE, 'strong,em')

makes its purpose clear: Markdown represents strong emphasized text using a single marker, so this class catches that case and wraps the text in two tags rather than one. How about HtmlPattern?

0730: class HtmlPattern (Pattern):
0731: 
0732:     def handleMatch (self, m, doc):
0733:         rawhtml = m.group(2)
0734:         inline = True
0735:         place_holder = self.stash.store(rawhtml)
0736:         return doc.createTextNode(place_holder)

The pattern used to instantiate its unique instance is:

0663: HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)'               # <...>

and again, it's the comment that saves me---this pattern is used to match literal HTML embedded in text.

There are five more of these classes: LinkPattern, ImagePattern, ReferencePattern (for matching references to pre-defined URLs), ImageReferencePattern (ditto, but for images), and AutolinkPattern (which matches literal URLs embedded in angle brackets). Each is instantiated one or more times to create matchers for each of the unique things Markdown supports. Now that I have the general pattern in my head, I don't bother to read them in detail---their names and the comments on the regexp's used to instantiate them are enough. Time for another quick break.

A Few More Utilities

There---one more cup of tea ought to be enough to finish this off. There's a block of "miscellaneous auxiliary classes" next that starts with a "block guru" to find the boundaries of indented blocks:

0935: """
0936: ======================================================================
0937: ========================== MISC AUXILIARY CLASSES ====================
0938: ======================================================================
0939: """
0940: 
0965: 
0966: 
0967: class BlockGuru:
0968: 
0969:     def _findHead(self, lines, fn, allowBlank=0):
0970:         """
0971:         Functional magic to help determine boundaries of indented
0972:         blocks.
0973: 
0974:         @param lines: an array of strings
0975:         @param fn: a function that returns a substring of a string
0976:                    if the string matches the necessary criteria
0977:         @param allowBlank: specifies whether it's ok to have blank
0978:                            lines between matching functions
0979:         @returns: a list of post processes items and the unused
0980:                   remainder of the original list"""
          ...
1027: 
1028: 
1029:     def detabbed_fn(self, line):
1030:         """ An auxiliary method to be passed to _findHead """
1031:         m = RE.regExp['tabbed'].match(line)
1032:         if m:
1033:             return m.group(4)
1034:         else:
1035:             return None
1036: 
1037: 
1038:     def detectTabbed(self, lines):
1039: 
1040:         return self._findHead(lines, self.detabbed_fn,
1041:                               allowBlank = 1)

I've omitted the body of _findHead because 42 lines of loop-wrapped conditionals isn't something I want to read right now. Having written line-oriented parsers myself, I have a pretty good idea of how this one is going to work; it (almost certainly) works, and I (almost certainly) will be able to understand it if I need to. The last two utility functions are a breeze by comparison:

1044: def print_error(string):
1045:     """Print an error string to stderr"""
1046:     sys.stderr.write(string +'\n')
1047: 
1048: 
1049: def dequote(string):
1050:     """ Removes quotes from around a string """
1051:     if ( ( string.startswith('"') and string.endswith('"'))
1052:          or (string.startswith("'") and string.endswith("'")) ):
1053:         return string[1:-1]
1054:     else:
1055:         return string

Core Markdown

Finally, Markdown itself, the heart of the whole parser. I could have started with it, then worked outward through the utilities and supporting classes to get a top-down view of how this all hangs together. I could equally well have fired the parser up in a debugger and stepped through it, skipping over constructor and method calls the first time, then diving deeper and deeper on subsequent passes until I had an execution-oriented view. They're all good strategies, and play well together; no matter which one you use, you don't want to hit a comment like this:

1057: """
1058: ======================================================================
1059: ========================== CORE MARKDOWN =============================
1060: ======================================================================
1061: 
1062: This stuff is ugly, so if you are thinking of extending the syntax,
1063: see first if you can do it via pre-processors, post-processors,
1064: inline patterns or a combination of the three.
1065: """

Gulp. Markdown is a little over 700 lines long. That's over a third of the total length of the file, all in one class. Let's start with method headers:

1105:     def __init__(self, source=None, ...):
1175:     def registerExtensions(self, extensions, configs):
1199:     def registerExtension(self, extension):
1203:     def reset(self):
1223:     def _transform(self):
1273:     def _processSection(self, parent_elem, lines,
1348:     def _processHeader(self, parent_elem, paragraph):
1360:     def _processParagraph(self, parent_elem, paragraph, inList, looseList):
1379:     def _processUList(self, parent_elem, lines, inList):
1383:     def _processOList(self, parent_elem, lines, inList):
1388:     def _processList(self, parent_elem, lines, inList, listexpr, tag):
1475:     def _linesUntil(self, lines, condition):
1490:     def _processQuote(self, parent_elem, lines, inList):
1528:     def _processCodeBlock(self, parent_elem, lines, inList):
1553:     def _handleInlineWrapper (self, line, patternIndex=0):
1579:     def _handleInline(self,  line):
1601:     def _applyPattern(self, line, pattern, patternIndex):
1663:     def convert (self, source = None):
1697:     def __str__(self):

Right: there's a constructor and four "public methods". The constructor sticks all the pattern-matching classes defined earlier into a table, and sets up other configuration values (e.g., sets self.registeredExtensions to an empty list and self.stripTopLevelTags to 1). Two of the public methods are for registering extensions---nothing to worry about there---and then there's convert, which is the main driver for converting a blob of Markdown to HTML:

1663:     def convert (self, source = None):
1664:         """
1665:         Return the document in XHTML format.
1666:         @returns: A serialized XHTML body.
1667:         """
1668: 
1669:         if source is not None: # Allow blank string
1670:             self.source = source
1671: 
1672:         if not self.source:
1673:             return u""
1674: 
1675:         try:
1676:             self.source = unicode(self.source)
1677:         except UnicodeDecodeError:
1678:             message(CRITICAL, 'UnicodeDecodeError: Markdown only accepts unicode or ascii input.')
1679:             return u""
1680: 
1681:         for pp in self.textPreprocessors:
1682:             self.source = pp.run(self.source)
1683: 
1684:         doc = self._transform()
1685:         xml = doc.toxml()
1686: 
1687:         # Return everything but the top level tag
1688:         if self.stripTopLevelTags:
1689:             xml = xml.strip()[23:-7] + "\n"
1690: 
1691:         for pp in self.textPostprocessors:
1692:             xml = pp.run(xml)
1693: 
1694:         return (self.docType + xml).strip()

Reading through convert gives me a pretty good idea of how everything else fits together: the preprocessors are run, _transform is called to turn the result into a tree, that's turned back into text, um, there's some magic where the first 23 and last 7 characters are stripped off (what's that about?), then the postprocessors are run and the result is returned. _transform seems the next logical place to go:

1223:     def _transform(self):
1224:         """
1225:         Transforms the Markdown text into a XHTML body document
1226:         @returns: A NanoDom Document
1227:         """
1228: 
1229:         # Setup the document
1230: 
1231:         self.doc = Document()
1232:         self.top_element = self.doc.createElement("span")
1233:         self.top_element.appendChild(self.doc.createTextNode('\n'))
1234:         self.top_element.setAttribute('class', 'markdown')
1235:         self.doc.appendChild(self.top_element)
1236: 
1237:         # Fixup the source text
1238:         text = self.source
1239:         text = text.replace("\r\n", "\n").replace("\r", "\n")
1240:         text += "\n\n"
1241:         text = text.expandtabs(TAB_LENGTH)
1242: 
1243:         # Split into lines and run the preprocessors that will work with
1244:         # self.lines
1245:         self.lines = text.split("\n")
1246: 
1247:         # Run the pre-processors on the lines
1248:         for prep in self.preprocessors :
1249:             self.lines = prep.run(self.lines)
1250: 
1251:         # Create a NanoDom tree from the lines and attach it to Document
1252:         buffer = []
1253:         for line in self.lines:
1254:             if line.startswith("#"):
1255:                 self._processSection(self.top_element, buffer)
1256:                 buffer = [line]
1257:             else:
1258:                 buffer.append(line)
1259:         self._processSection(self.top_element, buffer)
1260: 
1261:         #self._processSection(self.top_element, self.lines)
1262: 
1263:         # Not sure why I put this in but let's leave it for now.
1264:         self.top_element.appendChild(self.doc.createTextNode('\n'))
1265: 
1266:         # Run the post-processors
1267:         for postprocessor in self.postprocessors:
1268:             postprocessor.run(self.doc)
1269: 
1270:         return self.doc

Oops---my earlier comment was inaccurate. Markdown.convert only runs the text preprocessors before calling this method; the line-oriented preprocessors are called here, on lines 1248--1249. (The same is true of the postprocessors.) I don't understand why lines starting with '#' are special (line 1254), or why the commented-out call to _processSection on line 1261 wasn't just taken out (the call on line 1259 seems to have taken its place), and even the author doesn't know why an extra newline is being added on line 1264, but never mind: the intent is clear.

So, what does _processSection do? The answer is, "a lot". According to its docstring, its job is to look for high-level structural elements like lists, block quotes, and so on, then strip them of their markup and process their content recursively. This method uses a table to drive calls to four others: _processUList, _processOList, _processQuote, and _processCodeBlock:

1273:     def _processSection(self, parent_elem, lines,
1274:                         inList = 0, looseList = 0):
1275:         """
1276:         Process a section of a source document, looking for high level
1277:         structural elements like lists, block quotes, code segments,
1278:         html blocks, etc.  Some those then get stripped of their high
1279:         level markup (e.g. get unindented) and the lower-level markup
1280:         is processed recursively.
1281: 
1282:         @param parent_elem: A NanoDom element to which the content
1283:                             will be added
1284:         @param lines: a list of lines
1285:         @param inList: a level
1286:         @returns: None
1287:         """
1288: 
1289:         # Loop through lines until none left.
1290:         while lines:
1291: 
1292:             # Check if this section starts with a list, a blockquote or
1293:             # a code block
1294: 
1295:             processFn = { 'ul':     self._processUList,
1296:                           'ol':     self._processOList,
1297:                           'quoted': self._processQuote,
1298:                           'tabbed': self._processCodeBlock}
1299: 
1300:             for regexp in ['ul', 'ol', 'quoted', 'tabbed']:
1301:                 m = RE.regExp[regexp].match(lines[0])
1302:                 if m:
1303:                     processFn[regexp](parent_elem, lines, inList)
1304:                     return
1305: 
1306:             # We are NOT looking at one of the high-level structures like
1307:             # lists or blockquotes.  So, it's just a regular paragraph
1308:             # (though perhaps nested inside a list or something else).  If
1309:             # we are NOT inside a list, we just need to look for a blank
1310:             # line to find the end of the block.  If we ARE inside a
1311:             # list, however, we need to consider that a sublist does not
1312:             # need to be separated by a blank line.  Rather, the following
1313:             # markup is legal:
1314:             #
1315:             # * The top level list item
1316:             #
1317:             #     Another paragraph of the list.  This is where we are now.
1318:             #     * Underneath we might have a sublist.
1319:             #
1320: 
1321:             if inList:
1322: 
1323:                 start, lines  = self._linesUntil(lines, (lambda line:
1324:                                  RE.regExp['ul'].match(line)
1325:                                  or RE.regExp['ol'].match(line)
1326:                                                   or not line.strip()))
1327: 
1328:                 self._processSection(parent_elem, start,
1329:                                      inList - 1, looseList = looseList)
1330:                 inList = inList-1
1331: 
1332:             else: # Ok, so it's just a simple block
1333: 
1334:                 paragraph, lines = self._linesUntil(lines, lambda line:
1335:                                                      not line.strip())
1336: 
1337:                 if len(paragraph) and paragraph[0].startswith('#'):
1338:                     self._processHeader(parent_elem, paragraph)
1339: 
1340:                 elif paragraph:
1341:                     self._processParagraph(parent_elem, paragraph,
1342:                                           inList, looseList)
1343: 
1344:             if lines and not lines[0].strip():
1345:                 lines = lines[1:]  # skip the first (blank) line

I would make some changes if I was refactoring:

  1. Use the keys from the processFn table to drive the loop on line 1300 rather than repeating them (and convert processFn to a list of pairs if the order is important).
  2. Simplify the while loop starting on line 1290: it's not immediately clear (a) that every pass through the loop is expected to cut some content out of lines, or (b) that the early return on line 1304 is safe.
  3. Reorganize the handling of the inList depth variable in one spot (OK, this is minor, but it's a habit I picked up the hard way).

There's another 320 lines of code in this class, but I'm going to skip over it: the "new understanding per line" ratio is dropping off pretty quickly. Some of this is fatigue, but mostly it's because I now understand the application well enough that almost everything I see just confirms things that are already in my mental model. That's not the same as saying that I already know them; rather, it means that nothing I see contradicts any of the ideas I've built up about the code's organization, structure, or operation. If I was planning to refactor this, I'd read the Markdown class in detail, since it would be the biggest candidate for reorganization, but I'm not, so I won't.

Everything Else

The home stretch is the driver functions near the bottom of the file that process files and their contents. markdownFromFile is:

1711: def markdownFromFile(input = None,
1712:                      output = None,
1713:                      extensions = [],
1714:                      encoding = None,
1715:                      message_threshold = CRITICAL,
1716:                      safe = False):
1717: 
1718:     global console_hndlr
1719:     console_hndlr.setLevel(message_threshold)
1720: 
1721:     message(DEBUG, "input file: %s" % input)
1722: 
1723:     if not encoding:
1724:         encoding = "utf-8"
1725: 
1726:     input_file = codecs.open(input, mode="r", encoding=encoding)
1727:     text = input_file.read()
1728:     input_file.close()
1729: 
1730:     text = removeBOM(text, encoding)
1731: 
1732:     new_text = markdown(text, extensions, safe_mode = safe)
1733: 
1734:     if output:
1735:         output_file = codecs.open(output, "w", encoding=encoding)
1736:         output_file.write(new_text)
1737:         output_file.close()
1738: 
1739:     else:
1740:         sys.stdout.write(new_text.encode(encoding))

No surprises there, and none in the markdown function either:

1742: def markdown(text,
1743:              extensions = [],
1744:              safe_mode = False):
1745: 
1746:     message(DEBUG, "in markdown.markdown(), received text:\n%s" % text)
1747: 
1748:     extension_names = []
1749:     extension_configs = {}
1750: 
1751:     for ext in extensions:
1752:         pos = ext.find("(")
1753:         if pos == -1:
1754:             extension_names.append(ext)
1755:         else:
1756:             name = ext[:pos]
1757:             extension_names.append(name)
1758:             pairs = [x.split("=") for x in ext[pos+1:-1].split(",")]
1759:             configs = [(x.strip(), y.strip()) for (x, y) in pairs]
1760:             extension_configs[name] = configs
1761: 
1762:     md = Markdown(extensions=extension_names,
1763:                   extension_configs=extension_configs,
1764:                   safe_mode = safe_mode)
1765: 
1766:     return md.convert(text)

Finally, there's a class called Extension, which presumably is supposed to be used as the base class for extensions (a docstring to this effect, and some usage notes, would be useful):

1769: class Extension:
1770: 
1771:     def __init__(self, configs = {}):
1772:         self.config = configs
1773: 
1774:     def getConfig(self, key):
1775:         if self.config.has_key(key):
1776:             return self.config[key][0]
1777:         else:
1778:             return ""
1779: 
1780:     def getConfigInfo(self):
1781:         return [(key, self.config[key][1]) for key in self.config.keys()]
1782: 
1783:     def setConfig(self, key, value):
1784:         self.config[key][0] = value

And that's it; after three hours and twenty minutes, I'm done.

Closing Remarks

So, was it worth it? I was already 99% sure that the code worked, and as I've said at least four times, I'm not planning to refactor it, so were 200 minutes of close reading a good investment? I think so: it'll probably be easier now for me to write the extensions I'll need to write to handle DrProject-specific links, and debugging anything that blows up will definitely be easier. More importantly, I'm now much more confident that the code will carry the weight we plan to put on it: it's organized the way I'd expect it to be, it uses idioms I'm familiar with, and generally shows plenty of signs of good artisanship.

Now all I have to do is integrate it...


Copyright (c) 2008 Greg Wilson