WO2009049275A1

WO2009049275A1 - Method for detecting and resolving hidden text salting

Info

Publication number: WO2009049275A1
Application number: PCT/US2008/079671
Authority: WO
Inventors: Marie-Francine Moens; Jan De Beer
Original assignee: Symantec Corporation
Priority date: 2007-10-12
Filing date: 2008-10-12
Publication date: 2009-04-16
Also published as: GB0719964D0

Abstract

Described is a technology by which hidden text salting is detected and resolved in digital text sources by analysing the reproduction of the text source and, subsequently, by cognitively processing the retained visible content to derive the perceived text. The rendering analysis applies a set of general visibility conditions to the rendered, attributed text primitives to determine their visibility. Retaining only the visible glyphs, an artificially intelligent cognitive model of human text reading is used to compose the glyphs in logical text fragments. The cognitive model which we propose searches for an optimal partitioning of the two-dimensional glyph space, in which logical text blocks are identified and assigned the most likely text reading direction. As a result, the invention enables the detection of general types of hidden salting tricks and makes available the actually perceived text contents, which can be used for further processing.

Description

METHOD FOR DETECTING AND RESOLVING HIDDEN TEXT SALTING

RELATED APPLICATIONS

[0001] The present application claims priority to United Kingdom Patent Application Serial No. GB0719964.9, filed on October 12, 2007. The contents of this application are incorporated herein by reference in its entirety. FIELD OF INVENTION

[0002] The invention relates to the inspection, filtering, and processing of textual data in an adversary environment. Described is a technology by which hidden text salting (i.e., distorted or hidden textual content patterns) is detected and resolved in digital text sources by analyzing the reproduction (graphical rendering) of the text source and, subsequently, by cognitively processing the retained visible content to derive the perceived text. The rendering analysis applies a set of general visibility conditions to the rendered, attributed text primitives (glyphs) to determine their visibility. Retaining only the visible glyphs, an artificially intelligent cognitive model of human text reading is used to compose the glyphs in logical text fragments (i.e., words in sentences in text blocks). The cognitive model which we propose searches for an optimal partitioning of the two-dimensional glyph space, in which logical text blocks are identified and assigned the most likely text reading direction. As a result, the invention enables the detection of general types of hidden salting tricks (e.g., as useful indicators for filtering) and makes available the actually perceived text contents, which can be used for further processing. BACKGROUND ART

[0003] Salting is the intentional addition or distortion of content patterns in a digital data source for reasons of obfuscation or evasion of certain methods of automated analysis or inspection, in particular that of content filtering. One commonly differentiates hidden salting (e.g., text displayed with invisible ink) from surface salting (e.g., images containing random, anomalous pixel dots), depending on whether the effects of salting are visually perceivable. The invention targets hidden salting of textual data. Hidden salting is the most dangerous in the context of fraudulent schemes since naive users are easily mislead. Hidden salting is, for instance, found in phishing e-mails that aim at stealing personal information, which can be used to commit identity theft. Salting in digital content is a phenomenon that only recently has drawn some scientific attention. However, given the increasing usage and importance of automated content filtering, e.g., on interconnected networks including the Internet, it can be expected that salting will become more widespread and more sophisticated. In the past few years, researchers have primarily focused on email spam, as it constitutes one of the best known digital sources of salting. We wish to note however, that salting can be applied to any medium (e.g., e-mail, Web pages, electronic documents, MMS mobile messages, MP3 music files) and content type (e.g., text, picture, audio elements in these media). [0004] To the best of our knowledge, the remainder of this section lists the main technologies from literature that are relevant or related to the invention, both historical and state-of-the-art.

[0005] Traditionally, content filters are ignorant to the salting phenomenon and perform a shallow processing of the plaintext. They assume that structural and markup characteristics can be safely (and efficiently) discarded in the derivation of a plaintext that is deemed informationally equivalent to the covertext. However, it is important that content filters can distinguish content that is apparent to the end user from hidden content. The presence of hidden salting can signal a fraudulent scheme, and in addition, content filters can only operate accurately if the right content attributes are used in the filtering. In particular, the distortion of learned or hard-coded content patterns by hidden and/or surface salting patterns may confuse automatic filters. This phenomenon is illustrated by the recent surge of image spam (i.e., graphic-only spam messages), in which the images contain sensitive textual content, and are randomized by image surface salting techniques [20, 4, 9]. [0006] Traditionally, attempts to overcome salting resorted to a fixed set of humancoded, ad-hoc salting trick detection scripts. These software scripts target very specific, implementation-dependent instances of known (i.e., previously observed) salting tricks. The approach of crafting a set of salting detection rules provided for a natural extension to traditional, heuristic-based filters, which are based on manually maintained sets of detection rules. Acting as heuristics, the rules were weighted first by expert judgment, and later optimized by relaxation-based systems such as Bayesian networks (e.g., through supervised learning methods). A popular example of a (largely) heuristic-based filter is SpamAssassin [17]. Despite their initial success, heuristic- based filters proved to be an easy target for probing spammers, trying to get round the fixed set of built-in rules. In addition, the filters proved hard to maintain.

[0007] Presently, new attempts to overcome salting focus on the engineering of features other than plaintext tokens which are less sensitive to salting. However, these methods are prone to circumvention, and do not directly detect or resolve the presence and effects of salting, which is greatly appreciated. A recent line of work is the extraction of visual features from graphical content [20, 9], considering colour distributions, texture, detection of the presence of embedded text, detection of anomalous dots that don't fit the smoother gradients of light found in images of legitimate e-mail, etc. A recent study of Wu et al. [20] based on image properties and text detection in images indicates 37% improvement in spam detection rate over traditional text-based spam filters, but does not study the presence or effects of image salting. Fumera et al. [4] recognize embedded text in images using OCR tools. The additionally extracted tokens result in a reduction of spam misclassification up to around a half when using state-of-the-art, tokenization-based classifiers. As a general critique however, it is not determined or estimated how much salting is present in the data sets used, and how results degrade with an increase in noise, of either legitimate or adversarial nature.

[0008] In recent work, Bratko et al. [1 ] apply adaptive statistical data compression models to raw e-mail data (as binary or character sequences), making sensitive preprocessing steps unnecessary. Adaptively building up a separate model for classified ham and spam e-mail, the classification outcome of a new target e-mail is determined by the model that yields the best compression rate on the target e-mail. As measures of compression, the authors use the cross-entropy and the description length. They also evaluate the effects of visible noise by randomly substituting characters. Even after 20% of all characters are distorted, rendering messages practically illegible, they retain a respectable performance. However, the randomness assumption is inappropriate for modelling real-world surface salting and clearly does not address hidden salting patterns. [0009] Fuzzy signature anti-spam methods [1 1 , 5] apply hashing techniques to e-mail messages for the purpose of mass mailer detection. The hashes, alternatively termed checksums, signatures or digests, are computed at the different client sites and collected in a redundant network of secure servers. Any client may query these servers for counts on similar submitted messages (similar by matching checksum). Hence, a collective memory of bulk (spam) e-mail is formed, adding collaborative filtering capability to local, on-site filtering. Two popular distributed signature systems are DCC [16] and Razor [15]. In these distributed checksum systems, robustness is attributed to secret, non-trivial, inputinsensitive checksum computation schemes, producing values that are constant across common variations in bulk messages, e.g., personalizations and tracking numbers added by spammers. The checksum computation schemes are handcrafted and changed by the system developers as spam evolves. However, fuzzy checksum schemes pose several disadvantages. They are challenged by probing spammers (who equally have access to checksum servers) and the (distributed) computing power they yield to sufficiently randomize every message sent out. Moreover, the aim of using checksums is not to detect the presence or structure of salting. Regarding salting resolution, the checksums provide no useful basis for further content analysis (i.e., do not provide a noise-free plaintext). Lastly, distributed signature systems rely on an active community of contributing email users, and their live' nature makes them unsuitable for evaluation using established data sets.

[0010] An component of the technology presented is the cognitive model, which is a particular instance of an automatic text reading system, studied by the Document Image Understanding (DIU) research discipline. In its general acceptation, DIU is the process that transforms the informative content of a document from paper into an electronic format outlining its logical content. Its objectives and scope are broader than those of Optical Character Recognition (OCR) research, which seems to focus on the recognition of isolated, individual characters. In general, DIU faces the more difficult problem of capturing structural content from noisy, degraded, skewed, limited-resolution input images, such as scanned images of carbon copy documents, low-resolution faxed documents, and n -th generation photocopies. For this reason, DIU mainly targets confined, controlled domains. The cognitive model of the invention differs from related DIU (OCR) methods in that it efficiently operates on correctly identified glyphs (i.e., intercepted from rendering commands rather than extracted by OCR from scanned images), and anticipates the use of unconventional (misleading) reading directions, which may still be correctly interpreted by humans' perceptive abilities (see Figure 2). [0011] The cognitive model presented, is related to segmentation and content recognition in computer vision. Image segmentation is a fundamental component in many computer vision applications and is typically defined as an exhaustive partitioning of an input image into regions, each of which is considered to be homogeneous with respect to some image property of interest [10]. The major difference of the invention is that it operates on glyphs resulting from the analysis of a rendering process, rather than the pixels of an image. In addition, pieces of text are segmented that have the same property of interest, namely the reading direction, and that are segregated from each other based on layout properties, in particular visual boundaries. In image segmentation, common properties of interest are brightness, colour, texture, motion, and pixel location proximity.

[0012] The major drawback of the above technologies is that they provide no general solution to the problem of hidden salting in digital data (text) sources. Some technologies provide specific solutions targetted at specific salting tricks, making them harder to maintain and easier to circumvent. Other technologies focus on the engineering of other features, under the assumption that they are less sensitive to salting. However, a solution for the direct detection and resolution of any (new) instance of salting trick is greatly appreciated, since it fulfills an essential requirement for robust (fail-safe), accurate, and endurable processing of data in general, and filtering of data in particular.

[0013] Of all salting forms, a solution that detects and resolves hidden salting (compared to surface salting) is most appreciated as this form of salting is the most dangerous in the context of fraudulent schemes since naive users are easily mislead. The invention aims to provide such solution. [0014] Furthermore, the invention makes it possible to analyze data sets on the presence of hidden salting, providing the statistics which are currently lacking from any study. SUMMARY OF THE INVENTION

[0015] It is the aim of the present invention to provide a method which makes it possible to solve the problems of the state of the art and to detect hidden or distorted content in a digital text source that is not perceived by the human end user when the plaintext of said digital text source is rendered through text production processes on output signals. The method of the present invention comprises the steps of: a) building an internal representation of the characters that appear on the output medium for the user to read, b) classifying each of the characters in said internal representation as visible or invisible by evaluating one or more visibility conditions on general visual properties of the character shapes (glyphs), c) indicating that the digital text source contains hidden content when one or more of the characters is classified as invisible. [0016] In a preferred embodiment of the present invention the visibility conditions comprise the font size of a glyph, the shape size of a glyph, the contrast of a glyph's color with the background color. The character can be marked as invisible when its glyph is substantially concealed by other shapes, including other glyphs or when its glyph is substantially clipped. Furthermore, the character can be marked as invisible when the display time of its glyph is too short to be observed or otherwise interpreted within context by humans (the display time determined e.g., by empirical studies). In another preferred embodiment of the present invention the method further comprises a) constructing a page containing only characters classified as visible, b) determining the reading order in which the visible characters are most I ikely read by humans, c) comparing the reading order obtained in (b) with the rendering order of the characters, d) indicating that the digital text source contains distorted content when the reading order differs from the rendering order.

[0017] In another preferred embodiment the reading order in which the visible characters are most likely read by humans is determined by a cognitive model.

[0018] In another embodiment of the present invention the method further comprises determining the perceived content of the digital text source by ordering the visible characters according to the determined reading order.

[0019] In one embodiment of the present invention the determining of the reading order comprises the steps of: a) creating a set of candidate reading orders, b) calculating, for each candidate reading order, the probability that said candidate reading order is the correct reading order on the basis of one or more discriminative features, c) selecting the reading order from said set of candidate reading orders by selecting the reading order with the highest probability.

[0020] In yet another embodiment of the present invention determining the reading order comprises the steps of: a) creating a set of candidate reading orders, b) creating a set of discriminative features, c) calculating, for each candidate reading order, the probability that said candidate reading order is the correct reading order on the basis of a discriminative feature out of the set of discriminative features, d) comparing the computed probabilities of (c) to remove unlikely reading order candidates from the set of candidate reading orders, e) repeating steps (c) and (d) for the remaining candidate reading orders whereby another discriminative feature is considered until all discriminative features out of the set of discriminative features are considered, f) selecting the reading order from the remaining set of candidate reading orders by taken all probabilities related to the different discriminative features into account.

[0021] In an embodiment of the present invention the page of visible characters is first partitioned into different text blocks and the reading order is determined for each text block. The digital text source is indicated as containing distorted content when one of the text blocks has a reading order which differs from the rendering order of the characters within that text block.

[0022] In a preferred embodiment of the present invention the partitioning of a page of visible characters into different text blocks comprises the steps of: 1. calculating an objective value of the page,

2. creating a set of candidate partitionings of said page,

3. calculating for each partitioning of the page obtained in (2) an objective value, 4. selecting from the set of candidate partitionings of the page obtained in

(2) and the page, the partitioning or page with the highest objective value.

[0023] Hereby, calculating an objective value of a page or partitioning of a page can be based on the confidence scores of its member text blocks. In an embodiment of the present invention creating a set of candidate partitionings is based on visual properties of the page, including but not limited to text layout and spacing.

[0024] In another embodiment of the present invention the text blocks with associated reading order are composed into an overall perceived text. The digital text source can be any text-oriented data formatted like e.g.: an e-mail message (e.g., an EML file), a Web page (e.g., an HTML file), an electronic text document (e.g., a Microsoft Word document, a PDF file, a PostScript file), an electronic presentation (e.g., a Microsoft PowerPoint document).

[0025] The present invention can be used for filtering digital content.

Other uses include but are not limited to: making content available in electronic form, leveraging digital content accessibility tools, comparison to other content reconstruction or extraction methods (such as OCR), search, ranking, mining and classification. DETAILED DESCRIPTION OF INVENTION

Definitions [0026] Term definitions are listed below in alphabetical order.

[0027] character A symbol in a writing system. Examples are letters, ligatures, punctuation, digits and various symbolic shapes with defined meaning.

[0028] clipping In the context of a graphical rendering system, the application of a regional mask that is defined in the coordinate input space and which controls the regions in the output space that are not affected by any of the rendering operations.

[0029] cognitive model A computer program simulating certain cognitive methods (i.e., processes of the mind). In the context of the invention, the methods are those of text perception and text reading. In particular, given a set of visible, positioned and attributed glyphs, the cognitive model's task is to define the glyphs' reading order.

[0030] compositional order The order in which a set of glyphs is reproduced.

[0031] concealment In the context of a graphical rendering system, the event whereby the affected region (in the coordinate output space) of one rendering operation overlaps the affected region of a former rendering operation.

[0032] confidence score A quality measure of a partition. In particular, a measure of the joint belief in the logical relatedness of the glyphs contained in the partition and the assignment of its assumed reading order.

[0033] covertext. The reproduced, perceivable version of a digital text source. Is derived from the plaintext by a reproduction process.

[0034] digital content A machine-readable representation of any kind of information.

[0035] digital data source Same as digital content.

[0036] digital text source A machine-readable representation of text- based information in any language or format, possibly augmented or linked with style (markup) information and/or multimedia objects (images, sound, video, etc.). Examples of digital text sources are e-mail messages, MMS messages, Web pages, productivity software documents (DOC, PPT, PDF, PS).

[0037] display time In the context of a rendering process, the duration of time in which a rendered shape remains visible.

[0038] glyph The visual shape of a character. [0039] hidden salting Salting which aims to generate a covertext that is intentionally different from the analysed plaintext. For example, the letter size, spacing, typeface, colouring and other characteristics of a plaintext can be manipulated and these manipulations are hidden in the covertext. As another example, the reading order of the letters in the plaintext might completely diverge from the actual reading order in the covertext, resulting in a covertext with a completely different semantic meaning.

[0040] HTML Short for Hypertext Markup Language.

[0041] hypertext markup language The predominant markup language for Web pages, standardized by the W3C. It provides a means to describe the structure of text-based information in a document, to supplement that text with interactive forms, embedded images, and other objects. HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code which can affect the behavior of Web browsers and other HTML processors.

[0042] objective value A quality measure of a partitioning. In particular, a measure of how well the partitioning represents the text as observed (interpreted) through human perception. [0043] OCR Short for Optical Character Recognition.

[0044] optical character recognition The mechanical or electronic translation of images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text. Usually abbreviated to OCR.

[0045] output medium In the context of a rendering process, the target of the rendering operations, either physical (e.g., a monitor screen) or logical (e.g., an in- memory image buffer).

[0046] page An area in the output medium which encloses all visible glyphs.

[0047] partition A closed and connected area that covers part of the page. A partition is used to represent a text block.

[0048] partitioning A complete division of the page into one or more non- overlapping partitions. The completeness property states that the page is fully covered by the union of all partitions.

[0049] plaintext The literal, original, machine-readable (raw) version of a digital text source. Is reproduced (e.g., made visual) by a reproduction process into a covertext.

[0050] reading order The order in which a set of glyphs is most likely read by humans in a particular language.

[0051] reading direction One of possible reading orders of the glyphs in a text block. For example, the reading direction of this paragraph can be expressed as: 'read glyphs left to right on descending, successive lines'.

[0052] rendering The reproduction of digital content into visual output signals.

[0053] rendering order Same as compositional order, with reproduction meaning rendering.

[0054] rendering process Any computer process or method that performs rendering.

[0055] rendering system Any method that performs rendering. [0056] salting Intentional addition or distortion of content patterns in a digital data source for reasons of obfuscation or evasion of certain methods of automated analysis or inspection, in particular that of content filtering.

[0057] substantial clipping Clipping whereby the clipped-off region affected by the rendering operation is larger than some threshold from the total affected region by the rendering operation (i.e., disregarding any clipping).

[0058] substantial concealment Concealment whereby the overlapped region of a former rendering operation is larger than some threshold from the total region of that rendering operation. [0059] surface salting Salting which reveals itself as perceptible noise that is easily corrected and suppressed by the human mind or senses. Examples are the well known "v 1 a g r a" character plays in e-mail, or anomalous dots appearing in spam e-mail pictures.

[0060] text A sequence or constellation of characters in a particular writing system, meant for human interpretation in a communicative act. Text can be encoded into computer-readable formats (e.g., ASCII, UniCode, HTML). Text is usually distinguished from non-character encoded data, such as graphic images (e.g., encoded in the form of bitmaps).

[0061] text block A spatially and semantically coherent grouping of glyphs, such as a title, a paragraph, a column, etc.

[0062] text line A single span of characters that can be read by a

(semi)unidirectional eye movement.

[0063] text production process Any process which makes use of a rendering system to produce or reproduce a visual representation of a digital text source.

BRIEF DESCRIPTION OF THE DRAWINGS

[0064] Figure 1 is a flow diagram of the invention.

[0065] Figure 2 illustrates the concept of reading direction, and how the compositional glyph order is dubious. The bordered frame outlines a page containing two text blocks (shown dashed), whose perceived contents arise from rendering a virtual grid of text glyphs (shown asides). The grid traversal order during rendering can be manipulated, resulting in a compositional glyph order (dark arrows) that differs from the reading order (light arrows). [0066] Figure 3 shows the page of an example phishing e-mail that is properly divided into partitions (columns) by the cognitive model, with correctly assigned reading directions (top-left arrow pairs). Partitions' confidence scores are indicated bottom-right. [0067] Figure 4 illustrates the page partitioning search process. Part of a fictive search tree is shown, detailing on the transition from depth 2 to depth 3. In layer 2a, the partitions selected for refinement are shown hatched. Layer 2b depicts the generated refined partitionings, crossing over those with no improvement in objective value (indicated in the bottom-right page corner; the values are exemplary). In the example, the breadth of the search tree is restricted to maximum two candidate partitionings. This causes the elimination of the first retained rightmost candidate for depth 3. For clarity, reading directions are omitted from the figure.

[0068] Figure 5 shows a limited set of eight reading directions. Assuming a conventional writing system of ordered, parallel text lines, oriented along the page bounds (i.e., horizontal or vertical), the set contains all eight combinations of line orientation, line order, and glyph order along the lines. More specifically, every reading direction is referred to as A-B, where A indicates the order of the text lines, and B denotes the order of the glyphs along those lines. For A and B the following abbreviations are used: TD is top-down, BU is bottom-up, LR is left-right, and RL is right-left.

[0069] Figure 6 shows a standard set of eight partitioning patterns.

[0070] Figure 7 tabulates the basic statistics of the examples data set.

[0071] Figure 8 shows the distribution of feature scores for the different reading directions or text line orientations applied to entire pages of the examples data set.

[0072] Figure 9 shows ROC curves of the reading direction assignment function F. The curve labelled 'AN' denotes a macro-average over e-mail message categories.

[0073] Figures 10 - 13 tabulate the prevalence of detected hidden text salting tricks in subsets of the examples data set, by message category and trick type. 'Glyph size' consolidates 'font size' and 'shape size' (cf. infra). The numbers represent messages per thousand. Aggregates are provided over message categories ('AN' messages) and trick types ('Any' tricks, with and without 'glyph order'). Three salting degrees g are differentiated.

DETAILED DESCRIPTION OF PRESENT INVENTION

[0074] Described is a technology by which hidden text salting (i.e., distorted or hidden textual content patterns) is detected and resolved in digital text sources by analysing the reproduction (graphical rendering) of the text source and, subsequently, by cognitively processing the retained visible content to derive the perceived text. The rendering analysis applies a set of general visibility conditions to the rendered, attributed text primitives (glyphs) to determine their visibility. Next , retaining only the visible glyphs, an artificially intelligent cognitive model of human text reading is used to compose the glyphs in logical text fragments (i.e., words in sentences in text blocks). The cognitive model which we propose searches for an optimal partitioning of the two-dimensional glyph space, in which logical text blocks are identified and assigned the most likely text reading direction. As a result, the invention enables the detection of general types of hidden salting tricks (e.g., as useful indicators for filtering) and makes available the actually perceived text contents, which can be used for further processing.

[0075] Figure 1 gives a schematic outline of the method of the invention.

In short, on modern computing systems, the contents of a digital text source (in Figure 1 , the 'message source', (e.g., a Web page) are visualized to an end user by means of a text production process. This process (e.g., a Web browser) creates a parsed, internal representation of the text source and drives the rendering of that representation onto some output medium (in Figure 1 , the 'drawing canvas', e.g., a browser window). The essential idea of the invention is to tap into the rendering process, analyse the rendering commands and attributes to detect anomalies (in particular, hidden content) as manifestations of salting tricks. Next the intercepted, visible text characters (in Figure 1 , 'attributed glyphs') are fed into a cognitive model for the reproduction of the perceived text.

[0076] In the following, the rendering analysis and the workings of the cognitive model are described in detail. Lastly, to illustrate the practical applicability and usefulness of the invention, some of its potential uses are summarized. 1. Rendering Analysis [0077] The invention defines as input any process which makes use of a rendering system (described further) to produce or reproduce a visual representation of the digital text source. This text production process could be anything from a simple text viewing application to a full-featured Web browser, graphical email client, GUI widget or window (Graphical User Interface), etc. The process steers and controls the rendering through commands and directives that are understood by the rendering system.

[0078] A conventional rendering system is assumed, defining the operations for drawing primitive shapes such as lines, rectangles, polygons, images,. . . and text character sequences on an output medium. Whenever applicable, command arguments may control position, size, and other primitive-specific properties. A second set of commands sets or changes common rendering attributes (i.e., attributes used in many of the former drawing operations), including pen colour, pen thickness, pen stroke, background colour, typeface, letter size, etc. [0079] In one implementation of the invention, an open-source rendering software library is used, providing software routines that carry out the primitive drawing operations. The implementation hooks into those software routines, extending their code to monitor and intercept the various drawing primitives. Additional details are set forth below. [0080] The invention intercepts at the level of the rendering system all requests for drawing any of the text primitives. All text drawing operations require an argument that specifies the literal text to be drawn. Most text production processes use these operations for drawing text, since the operations are convenient and most efficient. In addition, any primitives which might conceal such rendered text (i.e., opaque, overlapping shapes) are also intercepted. This makes it possible to incrementally build an internal representation of the characters that appear on screen for the user to read. In particular, the representation that is used is a list of attributed glyphs; positioned shapes of individual characters that are decorated with rendering attributes and any concealing shapes. The glyphs are listed in the compositional order; the order in which they are rendered.

[0081] Inspection of the attributed glyphs reveals which glyphs are sufficiently visible to the human eye. Glyph visibility can be defined and verified by the joint satisfaction of several glyph visibility conditions. For simplicity reasons, the visibility of every glyph is fixed at this stage as a binary attribute, i.e., a glyph is determined as either visible or invisible. The binarization is done by thresholding the visibility measures that are implemented by the conditions. Threshold values can be obtained by empirical experimentation. When desired, different thresholds may apply, depending on the user (e.g., clear-sighted versus dim-sighted people) and/or target device (e.g., rich-colour monitor screens versus mono-colour, high-resolution printers).

[0082] Below, we set forth a basic set of glyph visibility conditions.

Conditions are named for future reference and indicate the applicable glyph type. The set is complete under the assumption of static content (i.e., content whose appearance does not change over time) and assuming simplified rendering conditions. In particular, for 'font colour' it is assumed that glyphs have a uniform, solid fill colour and a uniform background colour (i.e., not considering special compositing modes such as transparency). For instance, an implementation can sample the background colour at the glyph's center. For 'concealment', it is assumed that overlapping shapes are opaque. All these assumptions can be relaxed or removed by more complex (and computationally expensive) variants of these visibility conditions. Conversely, knowledge of the text production process may justify the use of simplifying assumptions.

[0083] As one glyph visibility condition, font size states that the glyph's font size (a rendering attribute) is sufficiently large. The condition applies to all glyphs. [0084] As one glyph visibility condition, shape size states that the glyph's shape (visual outline) is sufficiently large. The condition applies to all non-whitespace glyphs; glyphs necessitating actual drawing to produce their visual representation.

[0085] As one glyph visibility condition, font colour states that the glyph's fill colour contrasts well with the background colour. The condition applies to all non- whitespace glyphs.

[0086] As one glyph visibility condition, concealment states that the glyph is not substantially concealed by other glyphs or overlapping shapes. The condition applies to all glyphs. [0087] As one glyph visibility condition, clipping states that the glyph is drawn mostly inside the drawing clip; a spatial mask that is applied during rendering, and that remains within the target device's physical bounds. The condition applies to all glyphs. [0088] Failure to comply with any of the glyph visibility conditions results in an invisible glyph, which provides an indication of the presence of hidden salting tricks. For example, characters drawn in a zero-sized font will violate the 'font size' and 'shape size' conditions. The invisible ink trick is detected by the 'font colour' condition, and so on. Hence, the invention offers as one result the detection of general types of hidden content (salting) tricks, which manifest themselves in all affected glyphs. This result enables, for instance, the production of salting trick statistics that run over glyphs, possibly split out for the different trick types (i.e., visibility conditions).

[0089] As a remark to the rendering analysis part of the invention, it is noted that many text viewing applications (e.g., Web browsers, e-mail clients) and rendering systems are currently in existence. Despite standardizations in content markup, formatting and rendering, their subtle differences may be exploited by adversaries when countering an implementation of the invention that is pinned towards a single text production/rendering process. To avoid this, one could perform the analysis in parallel using several (popular) visualisation systems. The results can be compared and combined. 2. Cognitive Processing

[0090] The deceptive effects of all hidden text salting tricks, detected through the glyph visibility conditions, can be undone (resolved) by a two-step procedure. First, all invisible glyphs are eliminated from further consideration. Second, the reading order of the retained, visible glyphs is determined by a cognitive model. The cognitive model is an essential component, as no relation between the compositional order and the reading order can be reliably assumed (adversaries are known to exploit the naive assumption, e.g., the slice-and-dice trick [6], cf. Figure 2). Many alternative implementations of the cognitive model seem possible. Below, we elaborate one particular implementation.

[0091] Once determined, the reading order can be compared to the compositional order for the detection of order-related hidden salting tricks (alternatively, one could compare the corresponding texts using some distance metric, e.g., the edit distance). Specifically, an instance of a glyph order trick can be defined as a visible glyph whose successor reading glyph differs from its successor compositional glyph. Since reading glyphs are always visible, it should be noted that the detection subsumes all cases in which the successor compositional glyph is invisible. Resolution of this trick class is achieved simply by assuming the glyphs' reading order for further processing of the digital text source.

[0092] We turn to a particular implementation of the cognitive model. It exploits the conventional structuring of natural language written text in a limited set of text blocks. Text blocks correspond to disjoint, spatially bounded areas on the page, and their textual contents (i.e., contained glyphs) are assumed to obey a uniform reading direction. We note that this assumption may not hold in some languages. For example, Arabic texts are conventionally read from right to left, but may contain Western numbers or names printed left to right. This calls for a more complex text reading model, in which sporadic deviations from a text block's primary reading direction are possible. Following the simplified implementation, the objective is to find a proper division (partitioning) of the page into disparate text blocks (partitions). Hence, we reformulate the cognitive model's task as finding a coherent partitioning of the page, with proper reading directions assigned to the individual partitions. An example result is visualised in Figure 3.

[0093] The act of partitioning can be obtained using bottom-up

(agglomerative) or top-down (divisive) clustering algorithms, which aim to hierarchically group glyphs together into coherent text blocks. Given the observation that for most pages, the perceived text can already be represented by a small number of text blocks with possibly different reading directions, we opt for a top-down, divisive approach. By the sheer number of alternative partitionings, the limited number of sensible (desired) partitionings, and the constrained performance requirements for most practical applications, we propose a greedy, limited-width, breadth-first search strategy over the search space of all possible partitionings of the page. See also Figure 4. The breadth- first behaviour is motivated by the expectation that (sub)optimal partitionings are already to be found on shallow depths, following before mentioned observation.

[0094] The initial state of the search algorithm considers the initial partitioning as sole candidate partitioning. The initial partitioning defines a single partition that covers the entire page. The reading direction of this partition is defined by a labelling function F, which assigns the most likely reading direction out of a set of candidate reading directions to its argument partition. F is described further.

[0095] Moving from one depth in the search tree to the next, the set of candidate partitionings is updated. In particular, refined partitionings for every candidate partitioning are proposed (generated). A refinement of a partitioning can be generally defined as dividing (cutting) one of its partitions into multiple subpartitions. From the infinite number of possible refinements (spatial divisions of a partition's region), the implementation selects a finite, limited number candidates by 1 ) enforcing two constraints on viable cuts, 2) defining an equivalence relation on cuts, and 3) sampling with a custom sample size. The one constraint forbids cuts from intersecting glyphs (discarding glyphs with no visual shape, such as those representing spaces). Cuts satisfying this constraint, we refer to as free cuts. This is a reasonable constraint, since glyphs cannot be shared among partitions (text blocks). The other constraint restricts all cuts to be either horizontal or vertical, and spanning the entire partition region. Cuts satisfying both constraints, we refer to as free spanning cuts. The second constraint is motivated by the rectangularity assumption of partitions (oriented along the page bounds) and by computational efficiency. Namely, the coordinate ranges of all horizontally and vertically free spanning cuts are defined by the gaps in the corresponding glyph shape projection profile. The profile is formed by taking the union of the vertical, respectively horizontal sides (as line intervals) of the rectangles circumscribing the glyph shapes. This operation runs linearly in the order of the number of glyphs. Finally, since only the distribution of glyphs among subpartitions matters, we consider as equivalent all free spanning cuts that generate the same bi- partitioning of glyphs. The equivalence classes correspond to the gaps in the projection profiles. We arbitrarily choose as representative the one free spanning cut that intersects the middle of the gap. Assuming a finite number of visible glyphs (limited by the page dimensions and the glyph 'shape size' visibility condition), it is clear that the set of representative, free spanning cuts is finite and tractable. However, the arbitrary combination of several cuts in the generation of possible refinements may still be intractable (yet finite). For this reason, the implementation uses a limited and fixed set of partitioning patterns. A partitioning pattern is a template for refinement. It defines the number of horizontal and vertical cuts to use in the refinement, which are kept small. A standard set of partitioning patterns is depicted in Figure 6. [0096] In the implementation, a custom, limited number of refinements are generated iteratively by using a sampling (stochastic) method. The particular sampling method, broken down into different steps, first selects a partition for refinement (inversely proportional to its confidence value, so as to focus on lesser confident partitions). Second, a single partitioning pattern is selected (at random). Third, the required free spanning cuts are sampled (where gap width or other layout properties influence sampling probabilities). The instantiated pattern produces one refinement. Duplicate refinements are ignored. Alternatively, deterministic selection and generation strategies may be employed. They enable reproducibility of results, at the cost of confined search space exploration.

[0097] Returning to the description of the search algorithm, better refined

(offspring) partitionings replace their parent candidate partitioning. The notion of 'better' could be expressed in a comparator function that considers some criteria. In the implementation, we define an objective function O that computes a real-valued quality measure of its argument partitioning. In particular, O computes a weighted average of the partitions' confidence scores. A single partition's confidence score C is a real- valued measure of the joint belief in the logical relatedness of the glyphs contained in that partition and the assignment of its reading direction predicted by F as being the 'proper' reading direction. C is computed from a set of layout and linguistic features. The current feature set is described further. O weighs the confidence scores by the size of the corresponding partitions, defined as the number of containing glyphs. Without such weighting scheme, the algorithm was found to exhibit a tendency to cut of small, high-confident partitions (e.g., containing lots of dictionary words) from large, less confident partitions (e.g., containing proportionally more rare words).

[0098] If a candidate partitioning fails to generate better offspring, it remains in the set of candidate partitionings. Doing so preserves it as a viable solution to the page partitioning problem. Optionally, before proceeding at the next depth of the search, the set of candidate partitionings can be reduced to narrow down the search to, for instance, the best few candidates only, as indicated by O (cf. greedy, limited- width search strategy). This reduction could be motivated in practical settings to avoid a combinatorial explosion of candidates.

[0099] The search algorithm is terminated at the start of a new depth when one of several stopping criteria is fulfilled. For example, stopping criteria might limit the search depth, the total search duration, or may specify sufficient levels of optimality. At termination, O pinpoints the optimal candidate partitioning Po within the set of candidate partitionings as a local or global optimum over the search space. Finally, the glyphs' reading order readily follows from the reading direction labellings of the partitions in Po.

[00100] In the following, we turn back to one particular implementation of the labelling function F, which assigns the most likely reading direction to its argument partition, along with a confidence score C. An example set of candidate reading directions is depicted in Figure 5.

[00101] In the implementation, F is a stepwise filter. Starting with the full set of candidate reading directions, candidates can be excluded in subsequent steps as more evidence is computed (cf. cascaded classification). As evidence, scores are computed for various layout and linguistic features, further detailed below. Every filter step computes the scores of all remaining candidates for a particular feature. Those candidates whose feature score is below some threshold from the maximum feature score are excluded. The use of thresholds relative to the maximum score ensures that at least one candidate survives our stepwise filter. Threshold values can be obtained by optimizing reading direction recognition accuracy on a training set. In the absence of further evidence, the final filter step makes a choice between the surviving reading direction candidates, either randomly or optimally. Opting for a random choice instead of taking the best candidate is expected to broaden the exploration of the search space, which may otherwise be hindered by a purely deterministic, greedy strategy. [00102] From all the feature scores of the chosen reading direction, the confidence score C is computed. In the implementation, C is defined as the product of those scores, with two modifications. First, a penalty factor lowers the confidence whenever a substantial visual gap crosses all text lines oriented according to the chosen reading direction. These gaps indicate discontinuities in reading the partition's textual contents. The effect and purpose of the total penalty is to promote the partition being reconsidered for refinement (cf. infra, refinement sampling method). Second, it is more beneficial to use scaled rather than the raw feature scores. Rescaling provides the flexibility to more evenly spread the feature scores over their domain, with a clearer separation of scores relating to the proper reading direction. Also, the rescaling ensures safe computations on computer hardware using limited-precision floating point number representations, since the original scores might be close to 0, for example. [00103] In order to illustrate some of the discriminative reading direction features that can be used in F, we continue by describing the filter steps of the example implementation.

[00104] In the first filter step, the layout feature exploits the property that whenever text is rendered in a variable-width font (i.e., a font where the advance or pixel width of a glyph depends on the character it represents), it is very likely that the same ordinal character position on different text lines corresponds to quite different pixel offsets, measured from the start of the line. Fixed-width fonts, however, result in glyphs being arranged in a regular grid of rows and columns. In either case, glyph widths do not affect the spacing between text lines. Thus, measuring the alignment of glyphs both horizontally and vertically may provide a determinant cue in the detection of the text lines' orientation. Instead of remaining agnostic in the case of fixed-width fonts, one may also consider the number of glyphs on each text line. For Western languages, it is very unlikely that multiple text lines all contain few glyphs only. Instead, this observation may indicate a false reading direction. For example, when reading a title or isolated paragraph vertically instead of horizontally.

[00105] For every candidate reading direction, the layout feature uses a single-pass clustering algorithm that groups the glyphs of the partition in text lines, oriented according to the considered reading direction. When clustering, a new text line is formed as soon as a glyph cannot be associated (position-wise) to any of the existing text lines. More precisely, we define the extent e(g) and baseline b(g) of a glyph g as the pixel height, respectively the lower vertical pixel coordinate in the case of horizontally oriented text lines, and as the pixel width, respectively the lower horizontal pixel coordinate in the case of vertically oriented text lines. Then, the glyph gi is associated to a text line L = {gj}j<i with baseline b(L) = avgj b(gj) if and only if |b(gi) - b(L)| < e(gi). Since this rule expresses bounded tolerance towards elevated or lowered glyphs (as with super- and subscripts), we define as a measure of alignment the deviation of glyphs' baselines with their associated text line's baseline, averaged over text lines (after clustering). In addition, we apply a penalty factor for text lines containing fewer glyphs than some threshold. The resulting feature score lies within [0, 1 ]. Figure 8(a) confirms that the measure is well capable of reliably detecting the orientation of text lines. Used in the first filter step, this feature may eliminate already half of the candidate reading directions. [00106] As described above, the organisation of glyphs in text lines for the different candidate reading directions allows an early reconstruction of candidate perceived texts T restricted to the partition at hand. These texts are utilized in the following filter steps (feature implementations). [00107] In the second filter step, the word lengths feature exploits the property that reconstructing text along different text line orientations leads to significant statistical differences in the distribution of word lengths. We define a word as a sequence of glyphs on a single text line, delimited by whitespace; either whitespace glyphs or a visible gap on the page. When encoding T as a sequence of word lengths W = < wθ, . . . ,wz > , the likelihood of W can be measured using a probabilistic model of normal word lengths, e.g., one that is derived from a reference corpus C. In order to reduce the computational complexity, the following sampling strategy can be used. One randomly samples n subsequences W [Ii . . . Ii + m - 1 ] of m words from W, and computes the joint probability of all successive k -grams therein, assuming independence for simplicity. A gram is defined here as the length of a single word wi . The probability of a k -gram is defined as its relative frequency in C, mixed with a uniform background model to avoid zero probabilities (unobserved, yet possible k- grams). Averaging over the n subsequences and multiplying over the contained k- grams then produces the word lengths feature score, within [0, 1 ]. [00108] Being both orientation-aware features, the mutual information of the word lengths feature to the layout feature is maximal (close to 1 ) when variable- width fonts are used. In cases where fixed-width fonts neutralize layout analysis, word lengths may provide additional evidence. However, their discriminative power is less, as can be observed from Figure 8(c). [00109] So far, only the orientation of text lines could be discerned. In the third filter step, the characters feature aims to identify the direction in which text lines are to be read. For this, we apply the same technique of the previous step to character sequences, equally termed k-grams. A gram is redefined as a single character ci taken from the character sequence C = < cO, . . . , cz > which represents T. Analogous to the previous step, we define the characters feature score within [0, 1 ]. From Figure 8(d), it is clear that the characters feature is capable of reliably identifying the true text line orientation and reading direction. [00110] In the fourth filter step, the common words feature performs a dictionary lookup of all the words in T. The dictionary D can be a listing of the most frequent words in a reference corpus. More precisely, we define the common words feature score as the relative number of words in T that are present in D, within [0, 1 ]. Of all features considered, this feature was found to have greatest discriminative power among all text line orientations and reading directions. This can be seen from Figure 8(b). However, since a lookup operation is performed for every word in T, this feature is computationally more expensive, and therefore considered at the filter's end.

[00111] The features which are used in the determination of the reading order are generic, and - although illustrated for the English language in figures 8 and 9 - they can be used for detecting the reading order for any language, including segmented (where word tokens are delimited by white space) and unsegmented languages (e.g., Asian languages such as Chinese). In the latter case, when using the second filter step, word lengths are recognized by using a dictionary of words or a language model in the considered language (manually or automatically acquired), and in case certain glyph sequences are not found in the dictionary, a default, close-to-zero probability for each glyph that is not part of a recognized word can be used. In case of ambiguity, i.e., when different configurations of words can be recognized, the word length feature score can be averaged over the different possible readings. Alternatively, when considering a language, new features for determining the reading order can be added, or some of the features described above can be deleted. 3. Uses of the Invention

[00112] One obvious use of the invention is the filtering of digital text sources, based on the detection of hidden salting tricks and/or the covertext. As potential content filtering applications, we mention the detection of Web spam [7], Web page cloaking [19], spoofing on the Internet [3], masqueraded data transfers on peer- to-peer networks, copyright infringements, unsolicited popups, spam and advertisements [8, 12], electronic greeting cards, IM (Instant Messaging) and MMS (Multimedia Messaging Service) communications on wired and mobile networks, offensive contents (e.g., pornography, scams, ideologic rhetoric), malware spread through technical subterfuge, and many more [14].

[00113] One other use of the invention is to make content available in electronic form to allow for preservation, widespread availability and use, ease of reproduction, facilitating retrieval, search, mining, etc. Currently, a lot of content is available only in raw, physical form, often on handwritten or printed paper (e.g., historical texts, manuscripts, library reference cards, forms, checks, postal mail pieces). Transforming this content to digital form requires sophisticated scanning devices and automatic text reading systems, which are able to cast the scanned input images in sensible spatial compositions, and interpret identified textual or symbolic content under proper reading directions. This latter, complicated problem could be addressed by the cognitive model of the invention.

[00114] One other use of the invention lies in leveraging digital content accessibility tools for visually impaired people [13]. Typically, these tools suffer from heterogeneous data sources lacking an inconsistent use of accessibility directives or descriptors, which mark up the content. Especially in these situations, the cognitive model could assist (either proactive or on user's request) by pointing out the layout of the page, and by revealing the perceivable, textual contents for any of the identified text regions. This approach enables a guaranteed basic interpretation and access to any content, regardless of medium, type, or accessibility annotation level.

[00115] The invention can be used in association with OCR in multiple ways. One use directs the adversary technique of embedding sensitive textual content in graphic images, which are drawn pixel-by-pixel on screen using image drawing operations (cf. image spam [20, 9]). Direct interception of the textual content from the image is not possible. Hence, the plaintext generated by the invention will be incomplete, since it represents only part of the perceived text. In recent work of Fumera et al. [4], the authors use OCR tools (Optical Character Recognition) for the extraction of textual content from images attached to e-mails. A similar approach can be followed to extend our invention. In particular, the images can be intercepted from the image drawing operations, are then preprocessed and scanned by means of OCR, and any recognized, positioned characters can be added to the list of attributed glyphs. The unaltered cognitive model will then be able to pick up these characters, and restore their reading order amidst and in relation to otherwise intercepted characters. [00116] Another use of OCR regards the detection of new hidden text salting tricks, undetectable by the current invention. The gist of the technique would be to compare the perceived text derived by the invention (cognitive model) with an OCR- extracted text from an image picturing a visual reproduction of the digital text source. Marked differences in textual content could signal new hidden text salting tricks (if not errors of either methods), and indicate vulnerabilities of the current invention (cognitive model). Manual analysis can be used to reveal the new trick, and the invention can be leveraged by incorporating the new trick, e.g., as a new visibility condition. [00117] The use of the invention is not restricted to static content (i.e., content whose appearance does not change over time). The invention supports dynamic content by incorporating a time dimension. In particular, every glyph can be annotated with its display time (start and end time of appearance), and all its attributes (including colour, size, position) can be tracked over time. This requires a continuous monitoring of a temporal rendering process. The continuous updating of the internal representation of attributed glyphs triggers a recurring application of the cognitive model to produce updated versions of the perceived text (as a function of time). Detection rules that signal dynamic content tricks can also be defined. For instance, identifying text that cannot be perceived because its display time is too short to be observed by humans. Examples

[00118] The figures, statistics and findings presented in this section serve to demonstrate the implementability, potential use and prospects of the invention. As they relate to one particular implementation of the invention, apply to a particular test data set, they are illustrative only. No guarantees, rights or fitness for any particular purpose can be derived from them.

[00119] For the examples below, a data set is used containing 252.515 e- mail messages from the time period 2000 - 2006, manually labelled into the categories spam (unsolicited mail), phishing (legitimate-looking, yet fraudulous mail [3]), and ham (legitimate mail). Orthogonally, a second classification is provided, differentiating e- mails that contain HTML content from those not containing any HTML (non-HTML). Because it is a richer text format, we expect to find more hidden salting in HTML e- mail. The basic statistics are given in Figure 7. The data set comprises the 92.189 messages from the 2001-2002 TREC public e-mail corpus [18], augmented with various private e-mail feeds. The private data creates a larger, more recent and more representative data set, also containing labelled phishing data; a likely source of hidden salting. Through sampling, it is confirmed that the messages' content language is English, and their reading direction is TD-LR across the entire page. [00120] Figure 8 shows the distribution of feature scores that are used in the classification of a text block's reading direction. The scores are computed for the entire pages of the e-mail data set. The resulting histograms illustrate the potential discriminative power of the different features. Figure 8(a) shows that the distribution of the layout' feature scores when measured in the wrong text line orientation (i.e., vertical instead of horizontal) is markedly different from the correct horizontal orientation. From Figure 8(b), it is clear that the occurrence of 'common words' is largest when measured in the correct reading direction (i.e., TD-LR). Figure 8(c) shows a significant shift in distribution of the 'word lengths' feature, when measured on horizontal rather than vertical text lines. Lastly, Figure 8(d) demonstrates that character n-grams (the 'characters' feature) can be used to reliably distinguish the correct reading direction. Note that the value scale of Figures 8(c) and 8(d) is base-10 logarithmic, as the raw feature scores are close to 0. Figure 9 shows ROC (Receiver Operating Characterstic) curves of the reading direction assignment function F at different steps of F and for the different message categories. The curve labelled 'AN' represents a macro-average over message categories. Recall and precision are defined as follows. Recall at step i of F is defined as the relative number of pages (likewise, messages) for which the set of candidate reading directions still contains the correct reading direction (i.e., TD-LR) at the end of step i. Precision at step i is the relative number of correct reading directions in the set of candidate reading directions at the end of step i, averaged over all pages. From Figure 9(a), it is clear that the layout' feature may reliably exclude up to half of the candidate reading directions (improving precision from 1/4 to 1/2). There is a correlation between filtering effectiveness at this step and the propertion of HTML messages in every message category. In particular, almost all phishing messages in our data set contain HTML content (cf. Figure 7), which is rendered in variable-width fonts, making the layout feature most effective.

[00121] In Figure 9(b), the message categories have different recall offsets (at 80% precision) as a result of the previous filtering steps (including step 1 , cf. Figure 8(a)). Lowering the filter threshold in the final step of F (considering the 'common words' feature), precision of ham and phishing can be increased to 98% and higher without loss of recall. This can be explained by the verbosity of these legitimate(- looking) messages, in which common words abound. Not only is spam less verbose, its words are often missing from our dictionary (e.g., product names and randomized words), so its curve declines earlier.

[00122] Concluding from Figure 9, the final best, macro-averaged assignment accuracy of F is 98% (98.04% precision, 99.61 % recall). These results may even be suboptimal due to the local threshold optimization procedure and the somewhat arbitrary choices for the thresholds in early steps of F. Furthermore, an unknown but estimated small percentage of the data set consists of noisy bogus e- mails, which create a small error residu

[00123] Figures 10 - 12 show how much hidden text salting is detected in the data set for the different message categories, trick types, and for three salting degrees g; the percentage of glyphs in the analysed page that are affected by any of the detected tricks. Because the TREC corpus is used in many spam filtering studies, Figure 13 reports on that corpus only. In the tables, 'Clip' stands for the 'clipping' trick, 'Conceal' for the 'concealment' trick, 'Font colour' for the 'font colour' trick, 'Glyph size' consolidates the 'font size' and 'shape size' tricks (cf. rendering analysis), and 'Glyph order' relates to a difference in reading order from compositional order (cf. cognitive processing). The numbers represent messages per thousand. Aggregates are provided over message categories ('AN' messages) and trick types ('Any' tricks, with and without 'glyph order'). Statistics on the 'glyph order' trick are indicative only. Its detection precision is expected to be lower due to the incomplete heuristic search, the imperfect or ambiguous page partitioning, and reading direction misassignments. The detection precision of the other trick types may be less than perfect due to rare rendering artefacts (e.g., improper layouting or formatting incompatibilities of the text production process) and the somewhat arbitrary thresholding of visibility conditions (cf. rendering analysis).

[00124] Regarding trick types, the statistics confirm that 'clipping' and 'concealment' is technically impossible to realize in plain text (non-HTML) messages, and hard to achieve in standard HTML (they more easily arise from rendering artefacts). Invisible 'font colour' is the most prevalent hidden salting trick, extensively used in both spam and phishing. Illegible 'font size' or glyph 'shape size' is more common in spam than in phishing (phishers may use zero-sized, invisible fonts, but will refrain from any visible small print). For the 'glyph order' trick, it seems that compositional order is most suspicious in spam. Regarding message categories and disregarding 'glyph order', no substantial hidden text salting (g > 10%) is found in the TREC ham and in the non-HTML ham (the vast majority of ham) of our entire data set. Overall, 1.8 per thousand ham messages is substantially salted, compared to 26.2 for spam (19.9 in TREC) and 30.8 for phishing. The smaller numbers for the older TREC corpus suggest that hidden salting is becoming more prevalent.

[00125] Concluding from Figures 10 - 12, application of the invention indicates that within the data set, at least 3-4% of all spam and phishing messages contain some type of hidden text salting (2-3% is substantially salted), while substantial hidden text salting is hardly found in legitimate messages.

References

[1 ] A. Bratko, G. Cormack, B. Filipi^"c, T. Lynam, B. Zupan, Spam filtering using statistical data compression models, Journal of Machine Learning Research 7 (2006)

2673-2698. [2] P. Daniels, W. Bright (eds.), The World's Writing Systems, Oxford, 1996.

[3] T. Dinev, Why spoofing is serious internet fraud, Communications of the ACM 49

(10) (2006) 77-82.

[4] G. Fumera, I. Pillai, F. RoIi, Spam filtering based on the analysis of text information embedded into images, Journal of Machine Learning Research 7 (2006) 2699-2720. [5] W. Gansterer, M. llger, P. Lechner, R. Neumayer, J. StrauAΫ, Anti-spam methods:

State-of-the-art, Tech. rep., Institute of Distributed and Multimedia Systems, Faculty of

Computer Science, University of Vienna, Austria (2005). URL http://www.ifs.tuwien.ac.at/neumayer/pubs/GAN05 report.pdf

[6] J. Graham-Cumming, The spammer's compendium, in: Proceedings of the 2003 Spam Conference, 2003, kept updated at http://www.jgc.org/tsc.html.

[7] Z. GyA^"ongyi, H. Garcia-Molina, Web spam taxonomy, in: Workshop on Adversarial

Information Retrieval on the Web, 2005.

[8] I. Hann, K. Hui, Y. Lai, S. Lee, I. Png, Who gets spammed?, Communications of the

ACM 49 (10) (2006) 83-87. [9] IronPort Systems, Image spam: The email epidemic of 2006, Security trends,

IronPort (2006).

URL http://www.ironport.com/pdf/ironport image spam datasheet.pdf [10] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM Computing

Surveys 31 (3) (1999) 264-323.

[1 1 ] K. Krueger, The spam battle 2002: A tactical update, Tech. rep., SANS (2002).

URL http://www.sans.org/reading room/whitepapers/email/ [12] S. McCoy, A. Everard, P. Polak, D. Galletta, The effects of online advertising,

Communications of the ACM 50 (3) (2007) 84-88.

[13] M. Paciello, Web Accessibility for People with Disabilities, CMP Books, 2000.

[14] B. Panda, J. Giordano, D. KaIiI, Next-generation cyber forensics, Communications of the ACM 49 (2) (2006) 44-47, special issue. [15] V. Prakash, Vipul's razor. URL http://razor.sourceforge.net/

[16] V. Schryver, Distributed checksum clearinghouses. URL http://www.rhyolite.com/anti-spam/dcc/

[17] The apache spamassassin project. URL http://spamassassin.apache.org

[18] Tree corpus download. URL http://plg.uwaterloo.ca/gvcormac/treccorpus/ [19] B. Wu, B. Davison, Detecting semantic cloaking on the web, in: WWW '06:

Proceedings of the 15th international conference on World Wide Web, 2006.

[20] C. Wu, K. Cheng, Q. Zhu, Y. Wu, Using visual features for anti-spam filtering, in:

IEEE International Conference on Image Processing, vol. 3, 2005.

Claims

CLAIMSWhat is claimed is:

1. A method for detecting hidden or distorted content in a digital text source that is not perceived by the human end user when the plaintext of said digital text source is rendered through text production processes on output signals, the method comprising the steps of:

1.1. building an internal representation of the characters that appear on the output medium for the user to read,

1.2. classifying each of the characters in said internal representation as visible or invisible by evaluating one or more visibility conditions on general visual properties of the glyphs,

1.3. indicating that the digital text source contains hidden content when one or more of the characters is classified as invisible.

2. The method of claim 1 further comprising 2.1. constructing a page containing only characters classified as visible,

2.2. determining the reading order in which the visible characters are most likely read by humans by using a cognitive model,

2.3. comparing the reading order obtained in (b) with the rendering order of the characters, 2.4. indicating that the digital text source contains distorted content when the reading order differs from the rendering order.

3. The method of claim 2 further comprises determining the perceived content of the digital text source by ordering the visible characters according to the reading order.

4. The method of any of the claims 2-3 wherein determining the reading order comprises the steps of:

4.1. creating a set of candidate reading orders,

4.2. creating a set of discriminative features,

4.3. calculating, for each candidate reading order, the probability that said candidate reading order is the correct reading order on the basis of a discriminative feature out of the set of discriminative features,

4.4. comparing the computed probabilities of (c) to remove unlikely reading order candidates from the set of candidate reading orders,

4.5. repeating steps (c) and (d) for the remaining candidate reading orders whereby another discriminative feature is considered until all discriminative features out of the set of discriminative features are considered,

4.6. selecting the reading order from the remaining set of candidate reading orders by taken all probabilities related to the different discriminative features into account.

5. The method of any of the claims 2-3 wherein determining the reading order comprises the steps of:

5.1. creating a set of candidate reading orders,

5.2. calculating, for each candidate reading order, the probability that said candidate reading order is the correct reading order on the basis of one or more discriminative features,

5.3. selecting the reading order from said set of candidate reading orders by selecting the reading order with the highest probability.

6. The method of any of the claims 2-5 wherein the page of visible characters is first partitioned into different text blocks and the reading order is determined for each text block and wherein the digital text source is indicated as containing distorted content when one of the text blocks has a reading order which differs from the rendering order of the characters within that text block.

7. The method of claim 6 wherein partitioning of a page of visible characters into different text blocks comprises the steps of:

7.1. calculating an objective value of the page;

7.2. creating a set of candidate partitionings of said page,

7.3. calculating for each partitioning of the page obtained in (b) an objective value,

7.4. selecting from the set of candidate partitionings of the page obtained in (b) and the page, the partitioning or page with the highest objective value.

8. The method of claim 7 wherein calculating an objective value of a page or partitioning of a page is based on the confidence scores of its member text blocks.

9. The method of any of the claims 7-8 wherein creating a set of candidate partitionings is based on visual properties of the page, including but not limited to text layout and spacing.

10. The method of any of the claims 6-9 wherein the text blocks with associated reading order are composed into an overall perceived text.

1 1.The method of any of the previous claims wherein said visibility conditions comprise the font size of a glyph, the shape size of a glyph, the contrast of a glyph's color with the background color.

12. The method of any of the previous claims wherein the character is marked as invisible when its glyph is substantially concealed by other shapes, including other glyphs.

13. The method of any of the previous claims wherein the character is marked as invisible when its glyph is substantially clipped.

14. The method of any of the previous claims wherein the character is marked as invisible when the display time of its glyph is too short to be observed or otherwise interpreted within a context by humans.

15. The method of any of the previous claims wherein the digital text source is: an e- mail message (e.g., an EML file), a Web page (e.g., an HTML file), an electronic text document (e.g., a Microsoft Word document, a PDF file, a PostScript file), an electronic presentation (e.g., a Microsoft PowerPoint document), or any other text- oriented data format.

16. The method of one of the previous claims wherein the plaintext contains HTML.

17. The use of the method of any of the previous claims for filtering digital content.,

18. The use of the method of any of the claims 1 -16 for making content available in electronic form.

19. A computer program product adapted to carry out the method of any of the claims

1 -16 when run on a computer.

20. A computer readable medium comprising computer executable program code adapted to carry out the steps of any of the claims 1 -16.