US20060190481A1 - Classifier Tuning Based On Data Similarities - Google Patents
Classifier Tuning Based On Data Similarities Download PDFInfo
- Publication number
- US20060190481A1 US20060190481A1 US11/380,375 US38037506A US2006190481A1 US 20060190481 A1 US20060190481 A1 US 20060190481A1 US 38037506 A US38037506 A US 38037506A US 2006190481 A1 US2006190481 A1 US 2006190481A1
- Authority
- US
- United States
- Prior art keywords
- data item
- classification
- threshold
- classifier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99937—Sorting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99944—Object-oriented database structure
- Y10S707/99945—Object-oriented database structure processing
Definitions
- spam e-mail can be disruptive, annoying, and time consuming.
- spam e-mail represents tangible costs in terms of storage and bandwidth usage, which costs are not negligible due to the large number of spam e-mails being sent.
- a data item classifier is set up by accessing data items of known classification; removing substantially similar items from the data items to identify unique data items; and configuring the data item classifier, based on the unique data items, for future classification of at least one data item of unknown class.
- the configuring results in the data item classifier being capable of determining a measure that a data item of unknown class belongs to a particular class.
- Training the scoring classifier may include analyzing the set of unique training data items to identify n features in the set of unique training data items and forming an n-by-m feature matrix.
- M is equal to the number of unique training data items in the set of unique training data items such that each row of the n-by-m feature matrix corresponds to one of the training data items in the set of training data items and entries in each row of the n-by-m feature matrix indicate which of the n features a corresponding training data item contains.
- Training the scoring classifier also may include reducing the n-by-m feature matrix to a N-by-m reduced feature matrix, where N is less than n; inputting each row of the N-by-m reduced feature matrix, along with the known classification of the training data item corresponding to the input row, into the scoring classifier such that the scoring classifier develops the classification model.
- Accessing the data items may include accessing a set of evaluation data items.
- Removing substantially similar items may include removing substantially similar items from the set of evaluation data items to obtain a set of unique evaluation data items.
- Configuring the data item classifier may include determining and setting the classification threshold based on the set of unique evaluation data items.
- the data items may be e-mails, such that, after configuration, the data item classifier may be used to filter out spam e-mail in a set of received e-mails of unknown classification.
- the misclassification costs may depend on varying costs of misclassifying subcategories of non-spam e-mail as spam e-mail.
- Implementations of this aspect may include one or more of the following features. For example, for at least one received data item, a classification output indicative of whether or not the data item belongs to the particular class may be obtained. The threshold value for the classification threshold that reduces misclassification costs may be determined also based on, at least in part, the classification output of the at least one data item.
- Obtaining a classification output for the at least one data item may include obtaining feature data for the data item by determining whether the data item has a predefined set of features; inputting the feature data into a probabilistic classifier to obtain a probability measure; and producing a classification output based on the probability measure.
- a class indication for the at least one data item may be received; and the value for the classification threshold that reduces misclassification costs may be determined also based on, at least in part, the class indication of the at least one data item.
- Determining a threshold value that reduces misclassification costs may include determining a threshold value that minimizes misclassification costs.
- the data items may be e-mails and the particular class is spam, such that the data item classifier is used to filter out spam e-mail in a set of received e-mails of unknown classification.
- the misclassification costs may depend on varying costs of misclassifying subcategories of non-spam e-mail as spam e-mail.
- an e-mail classifier determines whether at least one received e-mail should be classified as spam.
- the e-mail classifier includes a feature analyzer, a scoring classifier, a threshold comparator, a grouper, and a threshold selector.
- the feature analyzer obtains feature data for the e-mail by determining whether the e-mail has a predefined set of features.
- the scoring classifier provides a classification output indicative of whether or not the e-mail is spam based on the feature data, wherein the scoring classifier is trained using a set of unique training e-mails.
- the threshold comparator compares the classification output to a classification threshold and the e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the e-mail is spam.
- Implementations of this aspect may include one or more of the following features.
- the threshold selector may select and set a new value for the classification threshold also based on a classification output for at least one e-mail in the set of received e-mails.
- the threshold selector also may select and set a new value for the classification threshold based on a class indication for the at least one e-mail.
- an e-mail server in another aspect, includes a e-mail classifier and a mail handler.
- the e-mail classifier has a classification threshold that is adjusted according to similarity rates of unique e-mails received in an incoming e-mail stream.
- the e-mail classifier is configured to classify e-mails in an incoming e-mail stream as spam or legitimate based on a comparison of a classification output to a classification threshold, where the classification output is indicative of whether the incoming e-mail is spam.
- the mail handler handles an e-mail based on the class given to the e-mail by the e-mail classifier.
- Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
- FIG. 1 illustrates an exemplary networked computing environment that supports e-mail communications and in which spam filtering may be performed.
- FIG. 2 is a high-level functional block diagram of an e-mail server program that may execute on an e-mail server to provide large-scale spam filtering.
- FIG. 3 is a functional block diagram of a probabilistic e-mail classifier.
- FIGS. 4A and 4B collectively provide a flow chart illustrating the process by which the probabilistic e-mail classifier of FIG. 3 is trained.
- FIG. 6 is a flow chart illustrating the process by which the probabilistic e-mail classifier of FIG. 3 classifies incoming e-mail.
- FIG. 7 is a flow chart illustrating the process by which the classification threshold of the probabilistic classifier of FIG. 3 is adjusted during operation.
- a classifier is used to classify data items of unknown classification in a data stream.
- the classifier classifies a data item by determining a measure indicative of a degree of correlation between the data item and a particular class, producing a classification output based on the measure, and comparing the classification output to a classification threshold.
- the adjustment of the classification threshold to reduce misclassification costs can be performed using the information regarding actual similarity rate regardless of whether the classifier has been trained using unique data items.
- these techniques may be applied to other classification problems in which the similarity rate in the training and/or evaluation data is not likely to reflect the similarity rate experienced when classifying unknown data items, but in which information about the similarity rate during operation can be obtained.
- FIG. 1 illustrates an exemplary networked computing environment 100 that supports e-mail communications and in which spam filtering may be performed.
- Computer users are distributed geographically and communicate using client systems 110 a and 110 b .
- Client systems 110 a and 110 b are connected to ISP networks 120 a and 120 b , respectively. While illustrated as ISP networks, networks 120 a or 120 b may be any network, e.g., a corporate network.
- Clients 110 a and 110 b may be connected to the respective ISP networks 120 a and 120 b through various communication mediums, such as a modem connected to a telephone line (using, for example, serial line internet protocol (SLIP) or point-to-point protocol (PPP)) or a direct network connection (using, for example, transmission control protocol/internet protocol (TCP/IP)), a wireless Metropolitan Network, or a corporate local area network (LAN).
- E-mail servers 130 a and 130 b also are connected to ISP networks 120 a and 120 b , respectively.
- ISP networks 120 a and 120 b are connected to a global network 140 , e.g. the Internet, such that a device on one ISP network can communicate with a device on the other ISP network.
- a global network 140 e.g. the Internet
- ISP networks 120 a and 120 b For simplicity, only two ISP networks, 120 a and 120 b , have been illustrated as being connected to Internet 140 . However, there may be a great number of such ISP networks connected to Internet 140 . Likewise, each ISP network may have many e-mail servers and many client systems connected to the ISP network.
- Each of the client systems 110 a and 110 b and e-mail servers 130 a and 130 b may be implemented using, for example, a general-purpose computer capable of responding to and executing instructions in a defined manner, a personal computer, a special-purpose computer, a workstation, a server, a device, a component, or other equipment or some combination thereof capable of responding to and executing instructions.
- Client systems 110 a and 110 b and e-mail servers 130 a and 130 b may receive instructions from, for example, a software application, a program, a piece of code, a device, a computer, a computer system, or a combination thereof, which independently or collectively direct operations.
- These instructions may take the form of one or more communications programs that facilitate communications between the users of client systems 110 a and 110 b .
- Such communications programs may include, for example, e-mail programs, IM programs, file transfer protocol (FTP) programs, or voice-over-IP (VoIP) programs.
- the instructions may be embodied permanently or temporarily in any type of machine, component, equipment, storage medium, or propagated signal that is capable of being delivered to a client system 110 a and 110 b or the e-mail servers 130 a and 130 b.
- ISP networks 120 a and 120 b examples include Wide Area Networks (WANs), Local Area Networks (LANs), analog or digital wired and wireless telephone networks (e.g., a Public Switched Telephone Network (PSTN), an Integrated Services Digital Network (ISDN), or a Digital Subscriber Line (xDSL)), or any other wired or wireless network.
- WANs Wide Area Networks
- LANs Local Area Networks
- PSTN Public Switched Telephone Network
- ISDN Integrated Services Digital Network
- xDSL Digital Subscriber Line
- Networks 120 a and 120 b may include multiple networks or subnetworks, each of which may include, for example, a wired or wireless data pathway.
- E-mail server 130 a or 130 b may handle e-mail for many thousands of (if not more) e-mail users connected to ISP network 110 a or 110 b .
- E-mail server 130 a or 130 b may handle e-mail for a single e-mail domain (e.g., aol.com) or for multiple e-mail domains.
- E-mail server 130 a or 130 b may be composed of multiple, interconnected computers working together to provide e-mail service for e-mail users of ISP network 110 a or 130 b.
- An e-mail user such as a user of client system 110 a or 110 b may have one or more e-mail accounts on e-mail server 130 a or 130 b . Each account corresponds to an e-mail address. Each account may have one or more folders in which e-mail is stored. E-mail sent to one of the e-mail user's e-mail addresses is routed to e-mail server 130 a or 130 b and placed in the account that corresponds to the e-mail address to which the e-mail was sent. The e-mail user then uses, for example, an e-mail client program executing on client system 110 a or 110 b to retrieve the e-mail from e-mail server 130 a or 130 b and view the e-mail.
- the e-mail client program may be, for example, a stand-alone e-mail application such as Microsoft Outlook or an e-mail client application that is integrated with an ISP's client for accessing the ISP's network, such as America Online (AOL) Mail, which is part of the AOL client.
- the e-mail client program also may be, for example, a web browser that accesses web-based e-mail services.
- the e-mail client programs executing on client systems 110 a and 110 b also may allow one of the users to send e-mail to an e-mail address.
- the e-mail client program executing on client system 110 a allows the e-mail user of client system 110 a (the sending user) to compose an e-mail message and address it to a recipient address, such as an e-mail address of the user of client system 110 b .
- the sending user indicates that an e-mail is to be sent to the recipient address
- the e-mail client program executing on client system 110 a communicates with e-mail server 130 a to handle the transmission of the e-mail to the recipient address.
- e-mail server 130 a For an e-mail addressed to an e-mail user of client system 110 b , for example, e-mail server 130 a sends the e-mail to e-mail server 130 b .
- E-mail server 130 b receives the e-mail and places it in the account that corresponds to the recipient address. The user of client system 110 b then may retrieve the e-mail from e-mail server 130 b , as described above.
- a spammer typically uses an e-mail client or server program to send similar spam e-mails to hundreds, if not millions, of e-mail recipients.
- a spammer may target hundreds of recipient e-mail addresses serviced by e-mail server 130 b on ISP network 120 b .
- the spammer may maintain the list of targeted recipient addresses as a distribution list.
- the spammer may use the e-mail program to compose a spam e-mail and instruct the e-mail client program to use the distribution list to send the spam e-mail to the recipient addresses.
- the e-mail then is sent to e-mail server 130 b for delivery to the recipient addresses.
- FIG. 2 is a high-level functional block diagram of an e-mail server program 230 that may execute on an e-mail server, such as e-mail server 130 a or 130 b , to provide large-scale spam filtering.
- E-mail server program 230 includes a probabilistic e-mail classifier 232 and a mail handler 234 . During operation, the incoming e-mail arriving at e-mail server program 230 passes through probabilistic e-mail classifier 232 .
- e-mail classifier 232 makes the determination of whether or not an e-mail is spam by analyzing the e-mail to determine a confidence level or probability measure that the e-mail is spam, and comparing the probability measure to a threshold. If the probability measure is above a certain threshold, the e-mail is labeled as spam.
- classifier 232 is a probabilistic classifier, there is the chance that a spam e-mail will be misclassified as legitimate and that legitimate e-mail will be classified as spam. There are generally costs associated with such misclassifications. For the e-mail service provider, misclassifying spam e-mail as legitimate results in additional storage costs, which might become fairly substantial. In addition, failure to adequately block spam may result in dissatisfied customers, which may result in the customers abandoning the service. The cost of misclassifying spam as legitimate, however, may generally be considered nominal when compared to the cost of misclassifying legitimate e-mail as spam, particularly when the policy is to delete or otherwise block the delivery of spam e-mail to the e-mail user. Losing an important e-mail may mean more to a customer than mere annoyance. Cost, therefore, may take into account factors other than just monetary terms.
- misclassifying spam e-mail as legitimate e-mail may incur higher costs than misclassifying work related e-mails.
- misclassifying work related e-mails might incur higher costs than misclassifying e-commerce related e-mails, such as order or shipping confirmations.
- classifier 232 Before classifier 232 is used to classify incoming e-mail, classifier 232 is trained and the threshold is set to minimize such misclassification costs. Classifier 232 is trained using a training set of e-mail to develop an internal model that allows classifier 232 to determine a probability measure for unknown e-mail. Evaluation e-mail then is used to set the initial classification threshold of classifier 232 such that misclassification costs are minimized. In some implementations, misclassification costs also may be taken into account during classifier training. In these implementations, evaluation e-mail is used in the same manner to set the initial threshold, but the resulting threshold may differ because different probability measures will occur as a result of the difference between training in a manner that includes misclassification costs versus one that does not include misclassification costs.
- substantially similar e-mails are removed from both the training set and the evaluation set to prevent the classifier 232 from being improperly biased based on a rate or quantity of substantially similar e-mails in the training and evaluation sets that do not reflect the rate or quantity that would occur during classification.
- E-mail systems tend to be used by any given spammer to send the same or similar spam e-mail to a large number of recipients during a relatively short period of time. While the content of each e-mail is essentially the same, it normally varies to a degree. For example, mass e-mailings often are personalized by addressing the recipient user by their first/last name, or by including in the e-mail message body the recipient user's account number or zip code.
- spammers may purposefully randomize their e-mails so as to foil conventional spam detection schemes, such as those based on matching exact textual strings in the e-mail.
- the core of the e-mail remains the same, with random or neutral text added, often confusing such “exact-match” spam filters.
- the extra text may be inserted in such a way that it is not immediately visible to the users (e.g., when the font has the same color as the background).
- Other randomization strategies of spammers include: appending random character strings to the subject line of the e-mail, changing the order of paragraphs, or randomizing the non-alphanumeric content.
- spammers are likely to send large numbers of substantially similar e-mails, which may include the slight and purposefully introduced variations mentioned above and which, therefore, are not truly identical.
- One characteristic of spam e-mail is that essentially the same content tends to be sent in high volume.
- a measure of the number of substantially similar copies of a particular e-mail in an e-mail stream provides a good indicator of whether that e-mail is spam or not.
- the training and evaluation sets may not be representative of the actual duplication rate of e-mails examined during classification, due in part to a likelihood of a sample selection bias in the training and evaluation sets.
- personal e-mails may be hard to obtain because of privacy concerns, while spam and bulk mail may be easily obtained for training and initial threshold setting.
- the same e-mail may be duplicated a number of times in the collected sample, but the similarity rate or message multiplicity may not reflect the actual similarity rate that will occur during classification.
- the similarity rate may change during classification simply because of spammers changing their e-mail and/or the rate at which they are sending e-mails.
- Such an inaccurate reflection of the actual similarity rates on the part of the training and evaluation e-mails may improperly bias a classifier during training (both through feature selection and in developing a classification model) and when setting the initial classification threshold.
- This potentially improper bias of classifier 232 is avoided by removing substantially similar e-mails from the training and evaluation sets and by adjusting the classifier threshold periodically or otherwise to account for the actual similarity rates of incoming e-mails.
- the duplication rate of incoming e-mails is determined for the arriving e-mails.
- This empirical information, information about the class of some of the e-mails in the incoming stream (e.g., obtained from user complaints), and the classification outputs for the incoming e-mails during the period are used to adjust the threshold. Consequently, the classification threshold is adjusted during operation to account for the actual similarity rate of e-mails in the incoming stream.
- FIG. 3 is a functional block diagram of one implementation of probabilistic e-mail classifier 232 .
- E-mail classifier 232 includes a grouper 320 , a feature analyzer 330 , a feature reducer 340 , a probabilistic classifier 350 , a threshold selector 360 , a threshold comparator 370 , and a mail labeler 380 .
- the various components of e-mail classifier 232 generally function and cooperate during three phases: training, optimization, and classification. To simplify an understanding of the operation of e-mail classifier 232 during each phase, the data flow between the various e-mail classifier 232 components is shown separately for each phase.
- a non-broken line is shown for data flow during the training phase, a line broken at regular intervals (i.e., dotted) indicates data flow during the initial threshold setting phase, and a broken line with alternating long and short dashed lines indicates the data flow during classification.
- a set of t e-mails (the “training e-mails”) having a known classification (i.e. known as spam or legitimate) are accessed ( 410 ) and used to train classifier 232 .
- the training e-mails having a known classification (i.e. known as spam or legitimate) are accessed ( 410 ) and used to train classifier 232 .
- substantially similar e-mails are removed from the set of t training e-mails to obtain a reduced set of m unique training e-mails ( 420 ).
- Each e-mail in the unique set of m training e-mails then is analyzed to obtain the n features (described further below) of the unique set of training e-mails ( 430 ) and to form an n-by-m feature matrix ( 440 ).
- feature selection is performed to select N features of the n feature set, where N ⁇ n ( 450 ), and the n-by-m feature matrix is reduced accordingly to an N-by-m reduced feature matrix ( 460 ).
- the N-by-m reduced feature matrix is used along with the known classification of the unique training e-mails to obtain an internal classification model ( 470 ).
- a t set of training e-mails 310 a is input into classifier 232 and applied to grouper 320 .
- Grouper 320 detects substantially similar e-mails in the t set of training e-mails 310 a , groups the detected duplicate e-mails, and selects a representative e-mail from within each group of substantially similar e-mails to form a reduced set of m unique training e-mails ( 410 and 420 ).
- Grouper 320 may be implemented using known or future techniques for detecting substantially similar or duplicate documents that may or may not match exactly.
- grouper 320 may be implemented using the I-Match approach, described in Chowdhury et al., “Collection Statistics For Fast Duplicate Document Detection,” ACM Transactions on Information Systems, 20(2):171-191, 2002.
- the I-Match approach produces a single hash representation of a document and guarantees that a single document will map to one and only one cluster, while still providing for non-exact matching.
- Each document is reduced to a feature vector and term collection statistics are used to produce a binary feature selection-filtering agent.
- the filtered feature vector then is hashed to a single value for all documents that produced the identical filtered feature vector, thus producing an efficient mechanism for duplicate detection.
- Similarity detection approaches may be used.
- current similarity or duplication detection techniques can be roughly classed as similarity-based techniques or fingerprint-based techniques.
- similarity-based techniques two documents are considered identical if their distance (according to a measure such as the cosine distance) falls below a certain threshold.
- Some similarity-based techniques are described in C. Buckley et al., The Smart/Empire Tipster IR System , in TIPSTER Phase III Proceedings, Morgan Kaufmann, 2000; T. C. Hoad & J. Zobel, Methods of Identifying Versioned and Plagarised Documents , Journal of the American Society for Information Science and Technology, 2002; and M. Sanderson, Duplicate Detection in the Reuters Collection , Tech.
- the set of m unique training e-mails are passed to feature analyzer 330 ( 430 ).
- feature analyzer 330 analyzes the m set of unique training e-mails to determine n features of the set of m unique training e-mails (the “feature set”).
- the feature set may be composed of text and non-text features.
- Text features generally include the text in the bodies and subject lines of the e-mails.
- Non-text features may include various other attributes of the e-mails, such as formatting attributes (e.g., all caps), address attributes (e.g., multiple addressees or from a specific e-mail address), or other features of an e-mail message such as whether there is an attachment or image, audio or video features embedded in the e-mail.
- Feature analyzer 330 includes a text analyzer 330 b and a non-text analyzer 330 a .
- text analyzer 330 b identifies text features of each e-mail message in the set of m unique training e-mails.
- Text analyzer 330 b may tokenize each e-mail to determine the text features.
- a token is a textual component in the body or subject line and, for example, may be defined as letters separated from another set of letters by whitespace or punctuation. Text analyzer 330 b keeps track of tokens and e-mails within which the tokens occur.
- Non-text analyzer 330 a determines whether each non-text feature is present in each e-mail.
- the exact non-text features for which each e-mail is analyzed typically is a matter of design and empirical judgment.
- For each non-text feature a binary value is generated, indicating whether the feature is present or not.
- Feature analyzer 330 creates a sparse n-by-m feature matrix (where n is the total number of text and non-text features) from the results of text analyzer 330 b and non-text analyzer 330 a ( 440 ). Each entry in the matrix is a binary value that indicates whether the n th feature is present in the m th e-mail.
- the n-by-m feature matrix is provided to feature reducer 340 , reduces the n-by-m feature matrix to a sparse N-by-m reduced feature matrix (where N is less than n), using, for example, mutual information ( 450 and 460 ).
- feature reducer 340 selects a reduced set of the n features (the “reduced feature set”) and reduces the size of the feature matrix accordingly.
- Techniques other than mutual information may be used, alternatively or additionally, to implement such feature selection. For example, document frequency thresholding, information gain, term strength, or ⁇ 2 may be used.
- some implementations may forego feature selection/reduction and use the n element feature set, i.e., use all of the features from the set of m unique training e-mails.
- the N selected features are communicated to feature analyzer 330 ( 460 ), which analyzes the incoming e-mails during the initial threshold setting phase and the classification phase for the N selected features instead of all of the features in the incoming e-mails.
- the N-by-m reduced feature matrix is input into classifier 350 .
- Each row of the N-by-m reduced feature matrix corresponds to one of the unique training e-mails and contains data indicating which of the N selected features are present in the corresponding training e-mail.
- Each row of the reduced feature matrix is applied to classifier 350 .
- the known classification of the training e-mail to which the row corresponds also is input.
- probabilistic classifier 350 builds an internal classification model that is used to evaluate future e-mails with unknown classification (i.e., non-training e-mails) ( 470 ).
- Classifier 350 may be implemented using known probabilistic or other classification techniques.
- classifier 350 may be a support vector machine (SVM), a Na ⁇ ve Bayesian classifier, or a limited dependence Bayesian classifier.
- SVM support vector machine
- Classifier 350 also may be implemented using well-known techniques that account for misclassification costs when constructing the internal model. For example, A. Kolcz and J.
- the classification threshold is initially set using an optimization phase.
- a set of e evaluation e-mails (the “evaluation e-mails”) 310 b having a known classification (i.e. are known to either be spam or legitimate) is accessed ( 510 ) and used to set the initial classification threshold of classifier 232 .
- substantially similar e-mails are removed from the set of e evaluation e-mails to obtain a reduced set of o unique evaluation e-mails ( 520 ).
- Each e-mail in the o set of unique evaluation e-mails then is analyzed to determine whether or not it contains the N features of the reduced feature set ( 530 ). This data is used to obtain a probability measure for the e-mail and a classification output is produced from the probability measure ( 540 ). The classification output for each e-mail in the reduced set of evaluation e-mails is used along with the known classification of each e-mail in the set to obtain an initial threshold value that minimizes the misclassification costs ( 550 ). The classification threshold then is initially set to this value ( 560 ).
- the set of e evaluation e-mails 310 b is input into classifier 232 and applied to grouper 320 ( 510 ).
- Grouper 320 determines groups of substantially similar e-mails in the set of e evaluation e-mails 310 b and selects an e-mail from each group to form a reduced set of o unique evaluation e-mails ( 520 ).
- the N element feature vector for each evaluation e-mail is input into classifier 350 , which applies the internal model to the feature vector to obtain a probability measure that the corresponding e-mail is spam ( 540 ).
- a classification output is produced from this probability measure.
- the classification output for example, may be the probability measure itself or a linear or non-linear scaled version of the probability measure.
- the classification output is input to threshold selector 360 , along with the corresponding, known classification of the e-mail.
- threshold selector 360 determines the initial threshold ( 550 ).
- threshold selector constructs a Receiver Operating Characteristic (ROC) curve from the classification output and classifications and chooses an operating point on the ROC curve that minimizes misclassification costs.
- ROC Receiver Operating Characteristic
- the cost of misclassifying a spam e-mail as legitimate is assumed to be one, while cost represents the assigned cost of misclassifying legitimate e-mail as spam e-mail.
- the exact value of this parameter is chosen as a matter of design. For example, a value of 1000 may be chosen. As described further below, some implementations may use values of cost that depend on a legitimate e-mail's subcategory.
- Threshold selector 360 uses the classification outputs and known classifications to determine the threshold value that sets the operation of classifier 232 at a point on the classifier's ROC curve that minimizes L u , i.e. the misclassification costs. For example, threshold selector 360 may evaluate L u for a number of different threshold values and choose the one that minimizes L u .
- threshold comparator 370 uses this threshold during classification to make a decision as to whether an e-mail is spam or not.
- the incoming e-mail stream also is evaluated to periodically determine the similarity rates of the unique e-mails in the incoming e-mail stream ( 710 ).
- the similarity rates, the classification output of each e-mail during the period, and information about the class (e.g., obtained from user complaints) of some of the e-mails in the incoming stream are used to determine a new threshold value that minimizes the misclassification costs, given the similarity rate of the e-mails during the period ( 720 ).
- the classification threshold then is set to this new threshold value ( 730 ).
- the classification threshold of classifier 232 is continually adjusted based on information regarding the similarity rate of e-mails in the e-mail stream.
- incoming e-mails of unknown class 310 c are input into classifier 232 as they arrive at the e-mail server ( 710 ).
- the e-mail is input to grouper 320 to determine if it is substantially similar to an earlier e-mail.
- grouper 320 tracks the number of substantially similar e-mails that occur for each unique e-mail received during the period.
- a copy of each e-mail may be stored by grouper 320 .
- grouper 320 may use the set of stored e-mail to determine the number of substantially similar e-mails that occurred for each unique e-mail in the set (and, consequently, during the period).
- an incoming e-mail (whether a duplicate or not) is input to feature analyzer 330 .
- Feature analyzer 330 determines whether or not the incoming e-mail has the N features of the reduced feature set and constructs an N element feature vector ( 610 ).
- the N element feature vector is input into classifier 350 , which applies the internal classification model to the feature vector to obtain a probability measure that the e-mail is spam ( 620 ) and to produce a classification output.
- the classification output is input to threshold selector 360 and threshold comparator 370 .
- Threshold comparator 370 applies the comparison scheme ( 630 ) and produces an output accordingly.
- the output of threshold comparator 370 is applied to mail labeler 380 .
- the incoming e-mail also is input to mail labeler 380 .
- mail labeler 380 labels the incoming e-mail as spam ( 640 ).
- the output of threshold comparator 370 indicates the classification output is less than the classification threshold, mail labeler 380 labels the incoming e-mail as legitimate ( 650 ).
- the labeled e-mail then is output by mail labeler 380 and sent to mail handler 234 .
- the classification threshold is tuned so as to minimize misclassification costs given the similarity rates in the incoming e-mails ( 720 ).
- grouper 320 determines the number of substantially similar e-mails that occurred for each unique incoming e-mail during a period. This information is provided to threshold comparator 370 .
- class labels for the unique e-mails received during the period may be provided to threshold selector 360 .
- These class labels are determined during operation, for example, from customer complaints or reports of spam e-mail to the e-mail service provider. It is highly likely that at least some customers will report high-volume spam e-mails to the e-mail service provider. Thus, a class label of spam is provided to threshold selector 360 for those unique e-mails during the period that have previously been reported as spam. The other unique e-mails are considered legitimate.
- the parameter v represents the similarity rate. Spam is represented by s, while legitimate is represented by l.
- v) is the probability that the particular e-mail x occurs given the particular similarity rate v.
- x,v) is the probability that the e-mail x is spam given the particular similarity rate v.
- x,v) is the probability the particular e-mail x is legitimate given the particular similarity rate v.
- Threshold selector 360 estimates the parameters of L(v) for the period from the number of substantially similar e-mails for each unique e-mail during the period and the supplied class labels. Threshold selector 360 uses the parameter estimates and the classification outputs of all e-mail evaluated during the period to determine the classification threshold value that minimizes L(v) (and consequently L), i.e. the misclassification costs. When evaluating L(v) to determine the minimizing threshold value, as a practical matter and to improve the statistical validity, the evaluation may be done for whole (possibly large) ranges of v rather than carrying it out for every possible value of v. The new threshold value then is provided to threshold comparator 370 for use as the classification threshold to classify incoming e-mail during the next period ( 730 ).
- e-mail classifier 232 may be designed to classify e-mail into more than just those two classes. For instance, e-mail classifier may be designed and trained to classify e-mail not only as legitimate, but to further classify legitimate e-mail into one of a plurality of subcategories of legitimate e-mail. As an example, legitimate mail may have the following subcategories: personal, business related, e-commerce related, mailing list, and promotional. Personal e-mails are those that are exchanged between friends and family. Business related e-mails are generally those that are exchanged between co-workers or current and/or potential business partners.
- E-commerce related e-mails are those that are related to online purchases, such as registration, order, or shipment confirmations.
- Mailing list e-mails are those that relate to e-mail discussion groups to which users may subscribe.
- Promotional e-mail are the commercial e-mails that users have agreed to receive as part of some agreement, such as to view certain content on a web site.
- classifier 232 may be designed to take into account the varying misclassification costs of misclassifying e-mail in a given subcategory of legitimate e-mail as spam. For instance, misclassifying a personal e-mail as spam typically is considered more costly than misclassifying a business related message as spam. But it may be considered more costly to misclassify a business related e-mail as spam than misclassifying a promotional e-mail as spam. These varying misclassification costs may be taken into account both during training and when setting the classification threshold.
- cost ⁇ cat ⁇ P ⁇ ( cat ⁇ l , x ) ⁇ C ⁇ ( s , cat )
- l,x) is the probability that a particular legitimate e-mail x belongs to the subcategory cat (e.g., personal, business related, e-commerce related, mailing list, or promotional)
- C(s,cat) is the cost of misclassifying a legitimate e-mail belonging to the subcategory cat as spam.
- cost ⁇ ( x , v ) ⁇ cat ⁇ P ⁇ ( cat ⁇ l , x , v ) ⁇ C ⁇ ( s , cat )
- l,x,v) is the probability that a particular legitimate e-mail x belongs to the subcategory cat given the duplication rate v
- C(s,cat) is the cost of misclassifying a legitimate e-mail belonging to the subcategory cat as spam.
- the techniques described above are not limited to any particular hardware or software configuration. Rather, they may be implemented using hardware, software, or a combination of both.
- the methods and processes described may be implemented as computer programs that are executed on programmable computers comprising at least one processor and at least one data storage system.
- the programs may be implemented in a high-level programming language and may also be implemented in assembly or other lower level languages, if desired.
- a threshold could instead be chosen that reduces the misclassification costs to a predetermined level above the minimized cost level.
- the classification threshold has been described as being periodically adjusted during operation, in other implementations the threshold may be adjusted aperiodically. Alternatively, in other implementations, the threshold may be adjusted only a set number of times (e.g., once) during operation. Further, the threshold may be adjusted before or after the classification phase. As another example, a number of places in the foregoing description described an action as performed on each e-mail in a set or each e-mail in an e-mail stream; however, the performance of the actions on each e-mail is not required.
- the foregoing description has described an e-mail classifier that labels mail for handling by a mail handler.
- the e-mail classifier may be designed to handle the e-mail appropriately based on the comparison of the classification output to the classification threshold.
- the e-mail may be marked with the classification output and the mail handler may handle the e-mail differently based on the particular value of the classification output.
- the mail handler may be designed to only act on certain classes of e-mail and just deliver non-labeled e-mail. Thus, only e-mail in certain classes would need to be labeled.
- a classification output tuning function may be used to adjust the algorithm for producing classification outputs from the spam or other class score (e.g., the probability measure) to obtain the same effect as a change in the classification threshold value. To do so, the classification output tuning function may evaluate a number of algorithm adjustments and choose the one that results in minimum misclassification costs.
Abstract
Description
- This application is a divisional of U.S. patent application Ser. No. 10/740,821, filed Dec. 22, 2003 which claims priority to U.S. Provisional Patent Application Ser. No. 60/442,124, filed on Jan. 24, 2003, the entire contents of each are hereby incorporated by reference.
- This description relates to classifiers.
- With the advent of the Internet and a decline in computer prices, many people are communicating with one another through computers interconnected by networks. A number of different communications mediums have been developed to facilitate such communications between computer users. A prolific communication medium is electronic mail (e-mail).
- Email participants seem to receive an ever increasing number of mass, unsolicited, commercial e-mailings (colloquially known as e-mail spam or spam e-mail). Spam e-mail is akin to junk mail sent through the postal service. However, because spam e-mail requires neither paper nor postage, the costs incurred by the sender of spam e-mail are quite low when compared to the costs incurred by conventional junk mail senders. Consequently, e-mail users now receive a significant amount of spam e-mail on a daily basis.
- Spam e-mail impacts both e-mail users and e-mail providers. For e-mail users, spam e-mail can be disruptive, annoying, and time consuming. For e-mail and network service providers, spam e-mail represents tangible costs in terms of storage and bandwidth usage, which costs are not negligible due to the large number of spam e-mails being sent.
- In one aspect, a data item classifier is set up by accessing data items of known classification; removing substantially similar items from the data items to identify unique data items; and configuring the data item classifier, based on the unique data items, for future classification of at least one data item of unknown class. The configuring results in the data item classifier being capable of determining a measure that a data item of unknown class belongs to a particular class.
- Implementations of this aspect may include one or more of the following features. For example, accessing the data items may include accessing a set of training data items; removing substantially similar items may include removing substantially similar items from the set of training data items to obtain a set of unique training data items; and configuring the data item classifier may include training a scoring classifier to develop a classification model based on the set of unique training data items, wherein the classification model enables the scoring classifier to determine a measure for at least one data item of unknown classification.
- Training the scoring classifier may include analyzing the set of unique training data items to identify n features in the set of unique training data items and forming an n-by-m feature matrix. M is equal to the number of unique training data items in the set of unique training data items such that each row of the n-by-m feature matrix corresponds to one of the training data items in the set of training data items and entries in each row of the n-by-m feature matrix indicate which of the n features a corresponding training data item contains. Training the scoring classifier also may include reducing the n-by-m feature matrix to a N-by-m reduced feature matrix, where N is less than n; inputting each row of the N-by-m reduced feature matrix, along with the known classification of the training data item corresponding to the input row, into the scoring classifier such that the scoring classifier develops the classification model.
- Accessing the data items may include accessing a set of evaluation data items. Removing substantially similar items may include removing substantially similar items from the set of evaluation data items to obtain a set of unique evaluation data items. Configuring the data item classifier may include determining and setting the classification threshold based on the set of unique evaluation data items.
- Determining and setting the classification threshold may include obtaining feature data for at least one of the unique evaluation data items by determining whether the evaluation data item has a predefined set of features; inputting the feature data into the scoring classifier to obtain a classification output for the evaluation data item; determining a threshold value that reduces misclassification costs based on the classification output for the at least one evaluation data item and the known classification of the at least one evaluation data item; and setting the initial value of the classification threshold to the determined threshold value. Determining a threshold value that reduces misclassification costs may include determining a threshold value that minimizes misclassification costs.
- The data items may be e-mails, such that, after configuration, the data item classifier may be used to filter out spam e-mail in a set of received e-mails of unknown classification.
- The misclassification costs may depend on varying costs of misclassifying subcategories of non-spam e-mail as spam e-mail.
- In another aspect, a classification threshold of a data item classifier is adjusted based on received data items. The data item classifier classifies an incoming data item as a member of a particular class when a comparison of a classification output for the incoming data item to the classification threshold indicates the data item belongs to the particular class. To adjust the classification threshold, the similarity rate for unique data items in the received data items is determined. A threshold value for the classification threshold that reduces misclassification costs based on, at least in part, the similarity rate for unique data items in the received data items is determined; and the classification threshold is set to the threshold value.
- Implementations of this aspect may include one or more of the following features. For example, for at least one received data item, a classification output indicative of whether or not the data item belongs to the particular class may be obtained. The threshold value for the classification threshold that reduces misclassification costs may be determined also based on, at least in part, the classification output of the at least one data item.
- Obtaining a classification output for the at least one data item may include obtaining feature data for the data item by determining whether the data item has a predefined set of features; inputting the feature data into a probabilistic classifier to obtain a probability measure; and producing a classification output based on the probability measure.
- A class indication for the at least one data item may be received; and the value for the classification threshold that reduces misclassification costs may be determined also based on, at least in part, the class indication of the at least one data item.
- Determining a value for the classification threshold may include determining a value that minimizes:
where v represents a particular similarity rate, P(x|v) is the probability that the particular data item x occurs given the particular similarity rate v, P(s|x,v) is the probability that the data item x is the particular class given the particular similarity rate v, P(l|x,v) is the probability the particular data item x is not the particular class given the particular similarity rate v, [F(x)=s] is equal to one when an e-mail x is classified as a member of the particular class, zero otherwise, [F(x)=l] is equal to one when an e-mail x is not classified as a member of the particular class, zero otherwise, and cost (x,v) represents an assigned cost of misclassifying data items that are not members of the particular class as members of the particular class. - Determining a threshold value that reduces misclassification costs may include determining a threshold value that minimizes misclassification costs. The data items may be e-mails and the particular class is spam, such that the data item classifier is used to filter out spam e-mail in a set of received e-mails of unknown classification. The misclassification costs may depend on varying costs of misclassifying subcategories of non-spam e-mail as spam e-mail.
- In another aspect, an e-mail classifier determines whether at least one received e-mail should be classified as spam. The e-mail classifier includes a feature analyzer, a scoring classifier, a threshold comparator, a grouper, and a threshold selector. The feature analyzer obtains feature data for the e-mail by determining whether the e-mail has a predefined set of features. The scoring classifier provides a classification output indicative of whether or not the e-mail is spam based on the feature data, wherein the scoring classifier is trained using a set of unique training e-mails. The threshold comparator compares the classification output to a classification threshold and the e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the e-mail is spam. The grouper periodically determines the similarity rate for unique e-mails in the set of received e-mails. The threshold selector selects and sets a value for the classification threshold. The threshold selector selects and sets an initial value for the classification threshold that reduces misclassification costs based on a unique set of evaluation e-mails and the threshold selector selects and sets a new value for the classification threshold that reduces costs based at least on the similarity rates determined by the duplicate detector.
- Implementations of this aspect may include one or more of the following features. For example, the threshold selector may select and set a new value for the classification threshold also based on a classification output for at least one e-mail in the set of received e-mails. The threshold selector also may select and set a new value for the classification threshold based on a class indication for the at least one e-mail.
- In another aspect, an e-mail server includes a e-mail classifier and a mail handler. The e-mail classifier has a classification threshold that is adjusted according to similarity rates of unique e-mails received in an incoming e-mail stream. The e-mail classifier is configured to classify e-mails in an incoming e-mail stream as spam or legitimate based on a comparison of a classification output to a classification threshold, where the classification output is indicative of whether the incoming e-mail is spam. The mail handler handles an e-mail based on the class given to the e-mail by the e-mail classifier.
- The e-mail classifier may include a feature analyzer, a scoring classifier, a threshold comparator, a grouper, and a threshold selector. The feature analyzer obtains feature data for the e-mail by determining whether the e-mail has a predefined set of features. The scoring classifier provides a classification output indicative of whether or not the e-mail is spam based on the feature data, wherein the scoring classifier is trained using a set of unique training e-mails. The threshold comparator compares the classification output to a classification threshold and the e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the e-mail is spam. The grouper periodically determines the similarity rate for unique e-mails in the set of received e-mails. The threshold selector selects and sets a value for the classification threshold. The threshold selector selects and sets an initial value for the classification threshold that reduces misclassification costs based on a unique set of evaluation e-mails and the threshold selector selects and sets a new value for the classification threshold that reduces costs based at least on the similarity rates determined by the duplicate detector. The threshold selector may select and set a new value for the classification threshold also based on a classification output for at least one e-mail in the set of received e-mails. The threshold selector also may select and set a new value for the classification threshold based on a class indication for the at least one e-mail.
- Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
- The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
-
FIG. 1 illustrates an exemplary networked computing environment that supports e-mail communications and in which spam filtering may be performed. -
FIG. 2 is a high-level functional block diagram of an e-mail server program that may execute on an e-mail server to provide large-scale spam filtering. -
FIG. 3 is a functional block diagram of a probabilistic e-mail classifier. -
FIGS. 4A and 4B collectively provide a flow chart illustrating the process by which the probabilistic e-mail classifier ofFIG. 3 is trained. -
FIG. 5 is a flow chart illustrating a process by which the initial classification threshold of the probabilistic e-mail classifier ofFIG. 3 is set. -
FIG. 6 is a flow chart illustrating the process by which the probabilistic e-mail classifier ofFIG. 3 classifies incoming e-mail. -
FIG. 7 is a flow chart illustrating the process by which the classification threshold of the probabilistic classifier ofFIG. 3 is adjusted during operation. - In general, a classifier is used to classify data items of unknown classification in a data stream. The classifier classifies a data item by determining a measure indicative of a degree of correlation between the data item and a particular class, producing a classification output based on the measure, and comparing the classification output to a classification threshold.
- However, before the classifier is used to classify unknown data items, the classifier is configured—typically in at least the two following phases. In the first phase, the classifier is trained using a set of unique training data items (i.e., a data set that does not contain substantially similar data items) of known classification, enabling the classifier to determine a measure for unknown data items. In the second phase, an initial classification threshold is set. Similar to the training in the first phase, the initial classification threshold is determined using a set of unique evaluation data items of known classification, such that misclassifications costs are reduced with respect to the set of unique evaluation data items.
- Unique data sets are used in each phase to prevent the classifier from being improperly biased. The classification threshold of the classifier may be biased when the classifier is configured using training and evaluation sets that have similarity rates (i.e., rate at which substantially similar emails are encountered) that do not reflect similarity rates encountered when the classifier is used to classify unknown data items in a data stream. Using properly chosen unique sets of data items helps to remove such bias. Then, when the classifier is used to classify unknown items, information regarding the actual similarity rates of data items in the data stream is obtained and used to adjust the classification threshold such that misclassification costs are reduced given the actual similarity rates.
- The adjustment of the classification threshold to reduce misclassification costs can be performed using the information regarding actual similarity rate regardless of whether the classifier has been trained using unique data items.
- The description that follows primarily describes an application of these techniques to e-mail spam filtering. However, the techniques may be used for spam filtering in other messaging mediums, both text and non-text (e.g., images). For example, these techniques may be used for filtering spam sent using instant messaging, short messaging services (SMS), or Usenet group messaging.
- Moreover, these techniques may be applied to other classification problems in which the similarity rate in the training and/or evaluation data is not likely to reflect the similarity rate experienced when classifying unknown data items, but in which information about the similarity rate during operation can be obtained.
-
FIG. 1 illustrates an exemplarynetworked computing environment 100 that supports e-mail communications and in which spam filtering may be performed. Computer users are distributed geographically and communicate usingclient systems Client systems ISP networks networks Clients respective ISP networks E-mail servers ISP networks ISP networks global network 140, e.g. the Internet, such that a device on one ISP network can communicate with a device on the other ISP network. For simplicity, only two ISP networks, 120 a and 120 b, have been illustrated as being connected toInternet 140. However, there may be a great number of such ISP networks connected toInternet 140. Likewise, each ISP network may have many e-mail servers and many client systems connected to the ISP network. - Each of the
client systems e-mail servers Client systems e-mail servers client systems client system e-mail servers - Each
client system e-mail server - Examples of
ISP networks Networks -
E-mail server E-mail server E-mail server ISP network - An e-mail user, such as a user of
client system e-mail server e-mail server client system e-mail server - The e-mail client program may be, for example, a stand-alone e-mail application such as Microsoft Outlook or an e-mail client application that is integrated with an ISP's client for accessing the ISP's network, such as America Online (AOL) Mail, which is part of the AOL client. The e-mail client program also may be, for example, a web browser that accesses web-based e-mail services.
- The e-mail client programs executing on
client systems client system 110 a allows the e-mail user ofclient system 110 a (the sending user) to compose an e-mail message and address it to a recipient address, such as an e-mail address of the user ofclient system 110 b. When the sending user indicates that an e-mail is to be sent to the recipient address, the e-mail client program executing onclient system 110 a communicates withe-mail server 130 a to handle the transmission of the e-mail to the recipient address. For an e-mail addressed to an e-mail user ofclient system 110 b, for example,e-mail server 130 a sends the e-mail toe-mail server 130 b.E-mail server 130 b receives the e-mail and places it in the account that corresponds to the recipient address. The user ofclient system 110 b then may retrieve the e-mail frome-mail server 130 b, as described above. - In an e-mail environment such as that shown, a spammer typically uses an e-mail client or server program to send similar spam e-mails to hundreds, if not millions, of e-mail recipients. For example, a spammer may target hundreds of recipient e-mail addresses serviced by
e-mail server 130 b onISP network 120 b. The spammer may maintain the list of targeted recipient addresses as a distribution list. The spammer may use the e-mail program to compose a spam e-mail and instruct the e-mail client program to use the distribution list to send the spam e-mail to the recipient addresses. The e-mail then is sent toe-mail server 130 b for delivery to the recipient addresses. Thus, in addition to receiving legitimate e-mails (i.e., non-spam e-mails),e-mail server 130 b also may receive large quantities of spam e-mail, particularly when many spammers target e-mail addresses serviced bye-mail server 130 b. -
FIG. 2 is a high-level functional block diagram of ane-mail server program 230 that may execute on an e-mail server, such ase-mail server E-mail server program 230 includes aprobabilistic e-mail classifier 232 and amail handler 234. During operation, the incoming e-mail arriving ate-mail server program 230 passes throughprobabilistic e-mail classifier 232.E-mail classifier 232 classifies incoming e-mail by making a determination of whether or not a particular e-mail passing throughclassifier 232 is spam or legitimate e-mail (i.e., non-spam e-mail) and labeling the e-mail accordingly (i.e., as spam or legitimate).E-mail classifier 232 forwards the e-mail to mailhandler 234 andmail handler 234 handles the e-mail in a manner that depends on the policies set by the e-mail service provider. For example,mail handler 234 may delete e-mails marked as spam, while delivering e-mails marked as legitimate to an “inbox” folder of the corresponding e-mail account. Alternatively, e-mail labeled as spam may be delivered, to a “spam” folder or otherwise, instead of being deleted. The labeled mail may be handled in other ways depending on the policies set by the e-mail service provider. - As a probabilistic classifier,
e-mail classifier 232 makes the determination of whether or not an e-mail is spam by analyzing the e-mail to determine a confidence level or probability measure that the e-mail is spam, and comparing the probability measure to a threshold. If the probability measure is above a certain threshold, the e-mail is labeled as spam. - Because
classifier 232 is a probabilistic classifier, there is the chance that a spam e-mail will be misclassified as legitimate and that legitimate e-mail will be classified as spam. There are generally costs associated with such misclassifications. For the e-mail service provider, misclassifying spam e-mail as legitimate results in additional storage costs, which might become fairly substantial. In addition, failure to adequately block spam may result in dissatisfied customers, which may result in the customers abandoning the service. The cost of misclassifying spam as legitimate, however, may generally be considered nominal when compared to the cost of misclassifying legitimate e-mail as spam, particularly when the policy is to delete or otherwise block the delivery of spam e-mail to the e-mail user. Losing an important e-mail may mean more to a customer than mere annoyance. Cost, therefore, may take into account factors other than just monetary terms. - In addition to a variation in misclassification costs between misclassifying spam e-mail as legitimate e-mail and misclassifying legitimate e-mail as spam e-mail, there may be a variation in the costs of misclassifying different categories of legitimate e-mail as spam. For instance, misclassifying personal e-mails may incur higher costs than misclassifying work related e-mails. Similarly, misclassifying work related e-mails might incur higher costs than misclassifying e-commerce related e-mails, such as order or shipping confirmations.
- Before
classifier 232 is used to classify incoming e-mail,classifier 232 is trained and the threshold is set to minimize such misclassification costs.Classifier 232 is trained using a training set of e-mail to develop an internal model that allowsclassifier 232 to determine a probability measure for unknown e-mail. Evaluation e-mail then is used to set the initial classification threshold ofclassifier 232 such that misclassification costs are minimized. In some implementations, misclassification costs also may be taken into account during classifier training. In these implementations, evaluation e-mail is used in the same manner to set the initial threshold, but the resulting threshold may differ because different probability measures will occur as a result of the difference between training in a manner that includes misclassification costs versus one that does not include misclassification costs. - Before training and setting the initial threshold, substantially similar e-mails are removed from both the training set and the evaluation set to prevent the
classifier 232 from being improperly biased based on a rate or quantity of substantially similar e-mails in the training and evaluation sets that do not reflect the rate or quantity that would occur during classification. E-mail systems tend to be used by any given spammer to send the same or similar spam e-mail to a large number of recipients during a relatively short period of time. While the content of each e-mail is essentially the same, it normally varies to a degree. For example, mass e-mailings often are personalized by addressing the recipient user by their first/last name, or by including in the e-mail message body the recipient user's account number or zip code. - Also, spammers may purposefully randomize their e-mails so as to foil conventional spam detection schemes, such as those based on matching exact textual strings in the e-mail. Usually, the core of the e-mail remains the same, with random or neutral text added, often confusing such “exact-match” spam filters. For instance, the extra text may be inserted in such a way that it is not immediately visible to the users (e.g., when the font has the same color as the background). Other randomization strategies of spammers include: appending random character strings to the subject line of the e-mail, changing the order of paragraphs, or randomizing the non-alphanumeric content. Moreover, spammers are likely to send large numbers of substantially similar e-mails, which may include the slight and purposefully introduced variations mentioned above and which, therefore, are not truly identical. One characteristic of spam e-mail is that essentially the same content tends to be sent in high volume. As a result, a measure of the number of substantially similar copies of a particular e-mail in an e-mail stream provides a good indicator of whether that e-mail is spam or not.
- The training and evaluation sets, however, may not be representative of the actual duplication rate of e-mails examined during classification, due in part to a likelihood of a sample selection bias in the training and evaluation sets. For instance, personal e-mails may be hard to obtain because of privacy concerns, while spam and bulk mail may be easily obtained for training and initial threshold setting. In this case, the same e-mail may be duplicated a number of times in the collected sample, but the similarity rate or message multiplicity may not reflect the actual similarity rate that will occur during classification. In addition, the similarity rate may change during classification simply because of spammers changing their e-mail and/or the rate at which they are sending e-mails.
- Such an inaccurate reflection of the actual similarity rates on the part of the training and evaluation e-mails may improperly bias a classifier during training (both through feature selection and in developing a classification model) and when setting the initial classification threshold. This potentially improper bias of
classifier 232 is avoided by removing substantially similar e-mails from the training and evaluation sets and by adjusting the classifier threshold periodically or otherwise to account for the actual similarity rates of incoming e-mails. - Thus, over a period of time, the duplication rate of incoming e-mails is determined for the arriving e-mails. This empirical information, information about the class of some of the e-mails in the incoming stream (e.g., obtained from user complaints), and the classification outputs for the incoming e-mails during the period are used to adjust the threshold. Consequently, the classification threshold is adjusted during operation to account for the actual similarity rate of e-mails in the incoming stream.
-
FIG. 3 is a functional block diagram of one implementation ofprobabilistic e-mail classifier 232.E-mail classifier 232 includes agrouper 320, afeature analyzer 330, afeature reducer 340, aprobabilistic classifier 350, athreshold selector 360, athreshold comparator 370, and amail labeler 380. The various components ofe-mail classifier 232 generally function and cooperate during three phases: training, optimization, and classification. To simplify an understanding of the operation ofe-mail classifier 232 during each phase, the data flow between thevarious e-mail classifier 232 components is shown separately for each phase. A non-broken line is shown for data flow during the training phase, a line broken at regular intervals (i.e., dotted) indicates data flow during the initial threshold setting phase, and a broken line with alternating long and short dashed lines indicates the data flow during classification. - Referring to
FIG. 4A , in general, during the training phase (i.e., when a classification model is developed) (400) a set of t e-mails (the “training e-mails”) having a known classification (i.e. known as spam or legitimate) are accessed (410) and used to trainclassifier 232. To trainclassifier 232, substantially similar e-mails are removed from the set of t training e-mails to obtain a reduced set of m unique training e-mails (420). Each e-mail in the unique set of m training e-mails then is analyzed to obtain the n features (described further below) of the unique set of training e-mails (430) and to form an n-by-m feature matrix (440). Referring toFIG. 4B , feature selection is performed to select N features of the n feature set, where N<n (450), and the n-by-m feature matrix is reduced accordingly to an N-by-m reduced feature matrix (460). The N-by-m reduced feature matrix is used along with the known classification of the unique training e-mails to obtain an internal classification model (470). - More particularly, and with reference to the unbroken reference flowpath of
FIG. 3 , a t set oftraining e-mails 310 a is input intoclassifier 232 and applied togrouper 320.Grouper 320 detects substantially similar e-mails in the t set oftraining e-mails 310 a, groups the detected duplicate e-mails, and selects a representative e-mail from within each group of substantially similar e-mails to form a reduced set of m unique training e-mails (410 and 420). -
Grouper 320 may be implemented using known or future techniques for detecting substantially similar or duplicate documents that may or may not match exactly. For example,grouper 320 may be implemented using the I-Match approach, described in Chowdhury et al., “Collection Statistics For Fast Duplicate Document Detection,” ACM Transactions on Information Systems, 20(2):171-191, 2002. The I-Match approach produces a single hash representation of a document and guarantees that a single document will map to one and only one cluster, while still providing for non-exact matching. Each document is reduced to a feature vector and term collection statistics are used to produce a binary feature selection-filtering agent. The filtered feature vector then is hashed to a single value for all documents that produced the identical filtered feature vector, thus producing an efficient mechanism for duplicate detection. - Other similarity detection approaches may be used. In general, current similarity or duplication detection techniques can be roughly classed as similarity-based techniques or fingerprint-based techniques. In similarity-based techniques, two documents are considered identical if their distance (according to a measure such as the cosine distance) falls below a certain threshold. Some similarity-based techniques are described in C. Buckley et al., The Smart/Empire Tipster IR System, in TIPSTER Phase III Proceedings, Morgan Kaufmann, 2000; T. C. Hoad & J. Zobel, Methods of Identifying Versioned and Plagarised Documents, Journal of the American Society for Information Science and Technology, 2002; and M. Sanderson, Duplicate Detection in the Reuters Collection, Tech. Report TR-1997-5, Department of Computing Science, University of Glasgow, 1997. In fingerprint-based techniques, two documents are considered identical if their projections onto a set of attributes results are the same. Some fingerprint-based techniques are described in S. Brin et al., Copy Detection Mechanisms for Digital Documents, in Proceedings of SIGMOD, 1995, pp. 398-409; N. Heintze, Scalable Document Fingerprinting, in 1996 USENIX Workshop on Electronic Commerce, November 1996; and Broder, On the Resemblance and Containment of Documents, SEQS: Sequences '91, 1998.
- The set of m unique training e-mails are passed to feature analyzer 330 (430). During training,
feature analyzer 330 analyzes the m set of unique training e-mails to determine n features of the set of m unique training e-mails (the “feature set”). The feature set may be composed of text and non-text features. Text features generally include the text in the bodies and subject lines of the e-mails. Non-text features may include various other attributes of the e-mails, such as formatting attributes (e.g., all caps), address attributes (e.g., multiple addressees or from a specific e-mail address), or other features of an e-mail message such as whether there is an attachment or image, audio or video features embedded in the e-mail. -
Feature analyzer 330 includes atext analyzer 330 b and anon-text analyzer 330 a. During training,text analyzer 330 b identifies text features of each e-mail message in the set of m unique training e-mails.Text analyzer 330 b may tokenize each e-mail to determine the text features. A token is a textual component in the body or subject line and, for example, may be defined as letters separated from another set of letters by whitespace or punctuation.Text analyzer 330 b keeps track of tokens and e-mails within which the tokens occur. -
Non-text analyzer 330 a determines whether each non-text feature is present in each e-mail. The exact non-text features for which each e-mail is analyzed typically is a matter of design and empirical judgment. For each non-text feature, a binary value is generated, indicating whether the feature is present or not. -
Feature analyzer 330 creates a sparse n-by-m feature matrix (where n is the total number of text and non-text features) from the results oftext analyzer 330 b andnon-text analyzer 330 a (440). Each entry in the matrix is a binary value that indicates whether the nth feature is present in the mth e-mail. - The n-by-m feature matrix is provided to feature
reducer 340, reduces the n-by-m feature matrix to a sparse N-by-m reduced feature matrix (where N is less than n), using, for example, mutual information (450 and 460). In other words,feature reducer 340 selects a reduced set of the n features (the “reduced feature set”) and reduces the size of the feature matrix accordingly. Techniques other than mutual information may be used, alternatively or additionally, to implement such feature selection. For example, document frequency thresholding, information gain, term strength, or χ2 may be used. In addition, some implementations may forego feature selection/reduction and use the n element feature set, i.e., use all of the features from the set of m unique training e-mails. - The N selected features are communicated to feature analyzer 330 (460), which analyzes the incoming e-mails during the initial threshold setting phase and the classification phase for the N selected features instead of all of the features in the incoming e-mails.
- The N-by-m reduced feature matrix is input into
classifier 350. Each row of the N-by-m reduced feature matrix corresponds to one of the unique training e-mails and contains data indicating which of the N selected features are present in the corresponding training e-mail. Each row of the reduced feature matrix is applied toclassifier 350. As each row is applied toclassifier 350, the known classification of the training e-mail to which the row corresponds also is input. - In response to the N-by-m reduced feature matrix and corresponding classifications,
probabilistic classifier 350 builds an internal classification model that is used to evaluate future e-mails with unknown classification (i.e., non-training e-mails) (470).Classifier 350 may be implemented using known probabilistic or other classification techniques. For example,classifier 350 may be a support vector machine (SVM), a Naïve Bayesian classifier, or a limited dependence Bayesian classifier.Classifier 350 also may be implemented using well-known techniques that account for misclassification costs when constructing the internal model. For example, A. Kolcz and J. Alspector, SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs, ICDM-2001 Workshop on Text Mining (TextDM-2001), November 2001 provides a discussion of some techniques for training a probabilistic classifier in a manner that accounts for misclassification costs. - Once
classifier 350 is trained, the classification threshold is initially set using an optimization phase. Referring toFIG. 5 , in general, during the optimization phase (500) a set of e evaluation e-mails (the “evaluation e-mails”) 310 b having a known classification (i.e. are known to either be spam or legitimate) is accessed (510) and used to set the initial classification threshold ofclassifier 232. To set the initial classification threshold, substantially similar e-mails are removed from the set of e evaluation e-mails to obtain a reduced set of o unique evaluation e-mails (520). Each e-mail in the o set of unique evaluation e-mails then is analyzed to determine whether or not it contains the N features of the reduced feature set (530). This data is used to obtain a probability measure for the e-mail and a classification output is produced from the probability measure (540). The classification output for each e-mail in the reduced set of evaluation e-mails is used along with the known classification of each e-mail in the set to obtain an initial threshold value that minimizes the misclassification costs (550). The classification threshold then is initially set to this value (560). - In particular, and with reference to the dotted line of
FIG. 3 , during the initial threshold setting phase, the set ofe evaluation e-mails 310 b is input intoclassifier 232 and applied to grouper 320 (510).Grouper 320 determines groups of substantially similar e-mails in the set ofe evaluation e-mails 310 b and selects an e-mail from each group to form a reduced set of o unique evaluation e-mails (520). - Each e-mail in the o set of evaluation e-mails is input to feature analyzer 330 (530). For each e-mail,
feature analyzer 330 determines whether or not the e-mail has the N features of the reduced feature set (determined at 450 inFIG. 4B ) and constructs an N element feature vector. Each entry in the N element feature vector is a binary value that indicates whether the Nth feature is present in the e-mail. - The N element feature vector for each evaluation e-mail is input into
classifier 350, which applies the internal model to the feature vector to obtain a probability measure that the corresponding e-mail is spam (540). A classification output is produced from this probability measure. The classification output, for example, may be the probability measure itself or a linear or non-linear scaled version of the probability measure. The classification output is input tothreshold selector 360, along with the corresponding, known classification of the e-mail. - Once a classification output for each e-mail in the reduced set of evaluation e-mails has been obtained and input to
threshold selector 360, along with the corresponding classification,threshold selector 360 determines the initial threshold (550). Conceptually, threshold selector constructs a Receiver Operating Characteristic (ROC) curve from the classification output and classifications and chooses an operating point on the ROC curve that minimizes misclassification costs. - The misclassification costs of a given classifier F with respect to a set of unique e-mails can be expressed in one exemplary representation as:
L u =π·FP+(1−π)·cost·FN
where the false-positive rate (FP) is:
and the false-negative rate (FN) is:
and where π=su/Eu, E is an evaluation set of e-mail, Eu is the set of unique e-mails in set E, su is the spam e-mail subset of Eu, and lu is the legitimate e-mail subset of Eu. [F(x)=s] is equal to one when the classifier returns spam as the class, zero otherwise. [F(x)=l] is equal to one when the classifier classifies an e-mail as legitimate, zero otherwise. The cost of misclassifying a spam e-mail as legitimate is assumed to be one, while cost represents the assigned cost of misclassifying legitimate e-mail as spam e-mail. The exact value of this parameter is chosen as a matter of design. For example, a value of 1000 may be chosen. As described further below, some implementations may use values of cost that depend on a legitimate e-mail's subcategory. - The relationship between FP and FN for a given classifier is known as the Receiver Operating Characteristic. Different choices of the classification threshold for a classifier result in different points along the classifier's ROC curve.
Threshold selector 360 uses the classification outputs and known classifications to determine the threshold value that sets the operation ofclassifier 232 at a point on the classifier's ROC curve that minimizes Lu, i.e. the misclassification costs. For example,threshold selector 360 may evaluate Lu for a number of different threshold values and choose the one that minimizes Lu. - Once
threshold selector 360 determines the initial threshold value that minimizes the misclassification costs, the threshold value is input tothreshold comparator 370 and used as an initial classification threshold (560).Threshold comparator 370 uses this threshold during classification to make a decision as to whether an e-mail is spam or not. - Once an initial classification threshold is set,
classifier 232 is placed in operation to classify incoming e-mail. Referring toFIG. 6 , in general, during classification (600) each e-mail in the incoming e-mail stream is analyzed to determine whether or not it contains the N features of the reduced feature set (610). This data is used to obtain a probability measure and classification output for the e-mail (620). The e-mail is classified by comparing the classification output to the classification threshold and labeling the e-mail accordingly. The precise comparison scheme is a matter of choice. As one example, if the classification output is equal to or above the classification threshold (630), the e-mail is labeled as spam (640). If the classification output is below the classification threshold, the e-mail is labeled as legitimate (650). - The incoming e-mail stream also is evaluated to periodically determine the similarity rates of the unique e-mails in the incoming e-mail stream (710). At the end of each period, the similarity rates, the classification output of each e-mail during the period, and information about the class (e.g., obtained from user complaints) of some of the e-mails in the incoming stream, are used to determine a new threshold value that minimizes the misclassification costs, given the similarity rate of the e-mails during the period (720). The classification threshold then is set to this new threshold value (730). Thus, during operation, the classification threshold of
classifier 232 is continually adjusted based on information regarding the similarity rate of e-mails in the e-mail stream. - In particular, and with reference to the long-and-short dashed reference line of
FIG. 3 , during the classification phase, incoming e-mails ofunknown class 310 c are input intoclassifier 232 as they arrive at the e-mail server (710). As an incoming e-mail arrives, the e-mail is input togrouper 320 to determine if it is substantially similar to an earlier e-mail. During a given period,grouper 320 tracks the number of substantially similar e-mails that occur for each unique e-mail received during the period. Alternatively, depending on the techniques used to implementgrouper 320, a copy of each e-mail may be stored bygrouper 320. At the end of a period,grouper 320 may use the set of stored e-mail to determine the number of substantially similar e-mails that occurred for each unique e-mail in the set (and, consequently, during the period). - After being processed by
grouper 320, an incoming e-mail (whether a duplicate or not) is input to featureanalyzer 330.Feature analyzer 330 determines whether or not the incoming e-mail has the N features of the reduced feature set and constructs an N element feature vector (610). - The N element feature vector is input into
classifier 350, which applies the internal classification model to the feature vector to obtain a probability measure that the e-mail is spam (620) and to produce a classification output. The classification output is input tothreshold selector 360 andthreshold comparator 370. -
Threshold comparator 370 applies the comparison scheme (630) and produces an output accordingly. The output ofthreshold comparator 370 is applied to maillabeler 380. - The incoming e-mail also is input to mail
labeler 380. When the output ofthreshold comparator 370 indicates the classification output is equal to or greater than the classification threshold,mail labeler 380 labels the incoming e-mail as spam (640). When the output ofthreshold comparator 370 indicates the classification output is less than the classification threshold,mail labeler 380 labels the incoming e-mail as legitimate (650). The labeled e-mail then is output bymail labeler 380 and sent to mailhandler 234. - Periodically during the classification phase, the classification threshold is tuned so as to minimize misclassification costs given the similarity rates in the incoming e-mails (720). As described above,
grouper 320 determines the number of substantially similar e-mails that occurred for each unique incoming e-mail during a period. This information is provided tothreshold comparator 370. - In addition to information on the number of substantially similar e-mails, class labels for the unique e-mails received during the period may be provided to
threshold selector 360. These class labels are determined during operation, for example, from customer complaints or reports of spam e-mail to the e-mail service provider. It is highly likely that at least some customers will report high-volume spam e-mails to the e-mail service provider. Thus, a class label of spam is provided tothreshold selector 360 for those unique e-mails during the period that have previously been reported as spam. The other unique e-mails are considered legitimate. -
Threshold selector 360 then uses the classification outputs of all e-mail evaluated during the period, the number of substantially similar e-mails for each unique e-mail during the period, and the supplied class labels to determine a new threshold that minimizes the misclassification costs given the known number of substantially similar e-mails present in the e-mail stream. - The misclassification costs of a given classifier F with respect to a set of e-mails containing substantially similar copies can be expressed in an exemplary implementation as:
- The parameter v represents the similarity rate. Spam is represented by s, while legitimate is represented by l. P(x|v) is the probability that the particular e-mail x occurs given the particular similarity rate v. P(s|x,v) is the probability that the e-mail x is spam given the particular similarity rate v. P(l|x,v) is the probability the particular e-mail x is legitimate given the particular similarity rate v. [F(x)=s] is equal to one when an e-mail x is classified as spam, zero otherwise. [F(x)=l] is equal to one when an e-mail x is classified as legitimate, zero otherwise. The cost of misclassifying a spam e-mail as legitimate is assumed to be one, while cost (x,v) represents the assigned cost of misclassifying legitimate e-mail as spam e-mail (e.g., 1000). As described above with respect to the optimization phase, the exact value of this parameter is chosen as a matter of design or policy.
-
Threshold selector 360 estimates the parameters of L(v) for the period from the number of substantially similar e-mails for each unique e-mail during the period and the supplied class labels.Threshold selector 360 uses the parameter estimates and the classification outputs of all e-mail evaluated during the period to determine the classification threshold value that minimizes L(v) (and consequently L), i.e. the misclassification costs. When evaluating L(v) to determine the minimizing threshold value, as a practical matter and to improve the statistical validity, the evaluation may be done for whole (possibly large) ranges of v rather than carrying it out for every possible value of v. The new threshold value then is provided tothreshold comparator 370 for use as the classification threshold to classify incoming e-mail during the next period (730). - While described as classifying e-mail as either spam or legitimate,
e-mail classifier 232 may be designed to classify e-mail into more than just those two classes. For instance, e-mail classifier may be designed and trained to classify e-mail not only as legitimate, but to further classify legitimate e-mail into one of a plurality of subcategories of legitimate e-mail. As an example, legitimate mail may have the following subcategories: personal, business related, e-commerce related, mailing list, and promotional. Personal e-mails are those that are exchanged between friends and family. Business related e-mails are generally those that are exchanged between co-workers or current and/or potential business partners. E-commerce related e-mails are those that are related to online purchases, such as registration, order, or shipment confirmations. Mailing list e-mails are those that relate to e-mail discussion groups to which users may subscribe. Promotional e-mail are the commercial e-mails that users have agreed to receive as part of some agreement, such as to view certain content on a web site. - In addition, whether or not
e-mail classifier 232 is specifically designed to classify legitimate e-mail into subcategories,classifier 232 may be designed to take into account the varying misclassification costs of misclassifying e-mail in a given subcategory of legitimate e-mail as spam. For instance, misclassifying a personal e-mail as spam typically is considered more costly than misclassifying a business related message as spam. But it may be considered more costly to misclassify a business related e-mail as spam than misclassifying a promotional e-mail as spam. These varying misclassification costs may be taken into account both during training and when setting the classification threshold. - Training a classifier to develop a classification model that takes into account such varying misclassification costs generally is known and described in A. Kolcz and J. Alspector, “SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs,” ICDM-2001 Workshop on Text Mining (TextDM-2001), November 2001.
- When setting the initial threshold, such varying costs can be taken into account by setting:
where P(cat|l,x) is the probability that a particular legitimate e-mail x belongs to the subcategory cat (e.g., personal, business related, e-commerce related, mailing list, or promotional) and C(s,cat) is the cost of misclassifying a legitimate e-mail belonging to the subcategory cat as spam. - Similarly, when the threshold is adjusted during operation, such varying costs can be taken into account by setting:
where P(cat|l,x,v) is the probability that a particular legitimate e-mail x belongs to the subcategory cat given the duplication rate v and C(s,cat) is the cost of misclassifying a legitimate e-mail belonging to the subcategory cat as spam. - The following is an exemplary list of subcategories cat and an exemplary cost C(s,cat) that may be used:
Subcategory cat Misclassification Cost C(s, cat) Personal 1000 Business Related 500 E-commerce related 100 Mailing List Related 50 Promotional 25 - The techniques described above are not limited to any particular hardware or software configuration. Rather, they may be implemented using hardware, software, or a combination of both. The methods and processes described may be implemented as computer programs that are executed on programmable computers comprising at least one processor and at least one data storage system. The programs may be implemented in a high-level programming language and may also be implemented in assembly or other lower level languages, if desired.
- Any such program will typically be stored on a computer-usable storage medium or device (e.g., CD-Rom, RAM, or magnetic disk). When read into the processor of the computer and executed, the instructions of the program cause the programmable computer to carry out the various operations described above.
- A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, instead of using thresholds (whether set initially or during operation) that fully minimize the misclassification costs (i.e., reduce the misclassification cost to the minimized cost level), a threshold could instead be chosen that reduces the misclassification costs to a predetermined level above the minimized cost level. Also, while the classification threshold has been described as being periodically adjusted during operation, in other implementations the threshold may be adjusted aperiodically. Alternatively, in other implementations, the threshold may be adjusted only a set number of times (e.g., once) during operation. Further, the threshold may be adjusted before or after the classification phase. As another example, a number of places in the foregoing description described an action as performed on each e-mail in a set or each e-mail in an e-mail stream; however, the performance of the actions on each e-mail is not required.
- As yet another example, the foregoing description has described an e-mail classifier that labels mail for handling by a mail handler. However, in some implementations, it may not be necessary to label e-mail at all. For instance, the e-mail classifier may be designed to handle the e-mail appropriately based on the comparison of the classification output to the classification threshold. Alternatively the e-mail may be marked with the classification output and the mail handler may handle the e-mail differently based on the particular value of the classification output. In other implementations, it may not be necessary to label all classes of e-mail. For example, the mail handler may be designed to only act on certain classes of e-mail and just deliver non-labeled e-mail. Thus, only e-mail in certain classes would need to be labeled.
- Also, while a binary feature representation is described, one of skill in the art will appreciate that other types of representations may be used. For example, a term frequency-inverse document frequency (tf-idf) representation or a term frequency (tf) representation may be used. Also, for non-text features, non-binary representations may additionally or alternatively be used. For example, if video or audio data is included, the features may include, respectively, color intensity or audio level. In this case, the color intensity or audio level features may be stored in a representation that indicates their levels, not just whether they exist or not (i.e., their analog values may be stored and used).
- As another example, while
classifier 350 has been described as a probabilistic classifier, in general, theclassifiers 350 may be implemented using any techniques (whether probabilistic or deterministic) that develop a spam score (i.e., a score that is indicative of whether an e-mail is likely to be spam or not) or other class score for classifying or otherwise handling an e-mail. Such classifiers are generally referred to herein as scoring classifiers. - Further, while an implementation that adjusts a classification threshold value has been shown, other implementations may adjust the classification output to achieve the same affect as adjusting the classification threshold, as will be apparent to one of skill in the art. Thus, in other implementations, instead of threshold selector, a classification output tuning function may be used to adjust the algorithm for producing classification outputs from the spam or other class score (e.g., the probability measure) to obtain the same effect as a change in the classification threshold value. To do so, the classification output tuning function may evaluate a number of algorithm adjustments and choose the one that results in minimum misclassification costs. Because these two techniques obtain the same result, generally terms such as “determining a threshold value for the classification threshold that reduces misclassification costs”; “setting the classification threshold to the threshold value”; and “select and set a value for the classification threshold” should be understood as encompassing both techniques.
- Accordingly, implementations other than those specifically described are within the scope of the following claims.
Claims (16)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/380,375 US20060190481A1 (en) | 2003-01-24 | 2006-04-26 | Classifier Tuning Based On Data Similarities |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US44212403P | 2003-01-24 | 2003-01-24 | |
US10/740,821 US7089241B1 (en) | 2003-01-24 | 2003-12-22 | Classifier tuning based on data similarities |
US11/380,375 US20060190481A1 (en) | 2003-01-24 | 2006-04-26 | Classifier Tuning Based On Data Similarities |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/740,821 Division US7089241B1 (en) | 2003-01-24 | 2003-12-22 | Classifier tuning based on data similarities |
Publications (1)
Publication Number | Publication Date |
---|---|
US20060190481A1 true US20060190481A1 (en) | 2006-08-24 |
Family
ID=32829778
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/740,821 Expired - Fee Related US7089241B1 (en) | 2003-01-24 | 2003-12-22 | Classifier tuning based on data similarities |
US11/380,375 Abandoned US20060190481A1 (en) | 2003-01-24 | 2006-04-26 | Classifier Tuning Based On Data Similarities |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/740,821 Expired - Fee Related US7089241B1 (en) | 2003-01-24 | 2003-12-22 | Classifier tuning based on data similarities |
Country Status (2)
Country | Link |
---|---|
US (2) | US7089241B1 (en) |
WO (1) | WO2004068288A2 (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040148330A1 (en) * | 2003-01-24 | 2004-07-29 | Joshua Alspector | Group based spam classification |
US20080103849A1 (en) * | 2006-10-31 | 2008-05-01 | Forman George H | Calculating an aggregate of attribute values associated with plural cases |
US20080126493A1 (en) * | 2006-11-29 | 2008-05-29 | Mcafee, Inc | Scanner-driven email message decomposition |
US7555524B1 (en) * | 2004-09-16 | 2009-06-30 | Symantec Corporation | Bulk electronic message detection by header similarity analysis |
US20090254989A1 (en) * | 2008-04-03 | 2009-10-08 | Microsoft Corporation | Clustering botnet behavior using parameterized models |
US20100082627A1 (en) * | 2008-09-24 | 2010-04-01 | Yahoo! Inc. | Optimization filters for user generated content searches |
US20100185577A1 (en) * | 2009-01-16 | 2010-07-22 | Microsoft Corporation | Object classification using taxonomies |
US20110103682A1 (en) * | 2009-10-29 | 2011-05-05 | Xerox Corporation | Multi-modality classification for one-class classification in social networks |
US7958065B2 (en) | 2008-03-18 | 2011-06-07 | International Business Machines Corporation | Resilient classifier for rule-based system |
US8290203B1 (en) | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8290311B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US20130212047A1 (en) * | 2012-02-10 | 2013-08-15 | International Business Machines Corporation | Multi-tiered approach to e-mail prioritization |
US20130339276A1 (en) * | 2012-02-10 | 2013-12-19 | International Business Machines Corporation | Multi-tiered approach to e-mail prioritization |
US20140172652A1 (en) * | 2012-12-19 | 2014-06-19 | Yahoo! Inc. | Automated categorization of products in a merchant catalog |
US20150039983A1 (en) * | 2007-02-20 | 2015-02-05 | Yahoo! Inc. | System and method for customizing a user interface |
US9037660B2 (en) | 2003-05-09 | 2015-05-19 | Google Inc. | Managing electronic messages |
US9111282B2 (en) * | 2011-03-31 | 2015-08-18 | Google Inc. | Method and system for identifying business records |
US20160055145A1 (en) * | 2014-08-19 | 2016-02-25 | Sandeep Chauhan | Essay manager and automated plagiarism detector |
US9576271B2 (en) | 2003-06-24 | 2017-02-21 | Google Inc. | System and method for community centric resource sharing based on a publishing subscription model |
US10091556B1 (en) * | 2012-12-12 | 2018-10-02 | Imdb.Com, Inc. | Relating items to objects detected in media |
US11140115B1 (en) * | 2014-12-09 | 2021-10-05 | Google Llc | Systems and methods of applying semantic features for machine learning of message categories |
US20220132048A1 (en) * | 2020-10-26 | 2022-04-28 | Genetec Inc. | Systems and methods for producing a privacy-protected video clip |
US20230209115A1 (en) * | 2021-12-28 | 2023-06-29 | The Adt Security Corporation | Video rights management for an in-cabin monitoring system |
Families Citing this family (79)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6643686B1 (en) * | 1998-12-18 | 2003-11-04 | At&T Corp. | System and method for counteracting message filtering |
US8046832B2 (en) | 2002-06-26 | 2011-10-25 | Microsoft Corporation | Spam detector with challenges |
US7676546B2 (en) | 2003-03-25 | 2010-03-09 | Verisign, Inc. | Control and management of electronic messaging |
US20050015626A1 (en) * | 2003-07-15 | 2005-01-20 | Chasin C. Scott | System and method for identifying and filtering junk e-mail messages or spam based on URL content |
US7184160B2 (en) * | 2003-08-08 | 2007-02-27 | Venali, Inc. | Spam fax filter |
US20050065906A1 (en) * | 2003-08-19 | 2005-03-24 | Wizaz K.K. | Method and apparatus for providing feedback for email filtering |
US8886727B1 (en) | 2004-01-27 | 2014-11-11 | Sonicwall, Inc. | Message distribution control |
US20050188040A1 (en) * | 2004-02-02 | 2005-08-25 | Messagegate, Inc. | Electronic message management system with entity risk classification |
US9471712B2 (en) * | 2004-02-09 | 2016-10-18 | Dell Software Inc. | Approximate matching of strings for message filtering |
US7783706B1 (en) * | 2004-04-21 | 2010-08-24 | Aristotle.Net, Inc. | Filtering and managing electronic mail |
US7627670B2 (en) * | 2004-04-29 | 2009-12-01 | International Business Machines Corporation | Method and apparatus for scoring unsolicited e-mail |
EP1761863A4 (en) * | 2004-05-25 | 2009-11-18 | Postini Inc | Electronic message source information reputation system |
US7461063B1 (en) * | 2004-05-26 | 2008-12-02 | Proofpoint, Inc. | Updating logistic regression models using coherent gradient |
US7565445B2 (en) | 2004-06-18 | 2009-07-21 | Fortinet, Inc. | Systems and methods for categorizing network traffic content |
US7680890B1 (en) * | 2004-06-22 | 2010-03-16 | Wei Lin | Fuzzy logic voting method and system for classifying e-mail using inputs from multiple spam classifiers |
US7953814B1 (en) | 2005-02-28 | 2011-05-31 | Mcafee, Inc. | Stopping and remediating outbound messaging abuse |
US8484295B2 (en) | 2004-12-21 | 2013-07-09 | Mcafee, Inc. | Subscriber reputation filtering method for analyzing subscriber activity and detecting account misuse |
US20060031325A1 (en) * | 2004-07-01 | 2006-02-09 | Chih-Wen Cheng | Method for managing email with analyzing mail behavior |
US8037535B2 (en) * | 2004-08-13 | 2011-10-11 | Georgetown University | System and method for detecting malicious executable code |
US7545986B2 (en) * | 2004-09-16 | 2009-06-09 | The United States Of America As Represented By The Secretary Of The Navy | Adaptive resampling classifier method and apparatus |
US9015472B1 (en) | 2005-03-10 | 2015-04-21 | Mcafee, Inc. | Marking electronic messages to indicate human origination |
US8738708B2 (en) | 2004-12-21 | 2014-05-27 | Mcafee, Inc. | Bounce management in a trusted communication network |
US9160755B2 (en) | 2004-12-21 | 2015-10-13 | Mcafee, Inc. | Trusted communication network |
US8370437B2 (en) * | 2004-12-23 | 2013-02-05 | Microsoft Corporation | Method and apparatus to associate a modifiable CRM related token to an email |
US7603422B2 (en) * | 2004-12-27 | 2009-10-13 | Microsoft Corporation | Secure safe sender list |
US7599993B1 (en) * | 2004-12-27 | 2009-10-06 | Microsoft Corporation | Secure safe sender list |
US8165870B2 (en) * | 2005-02-10 | 2012-04-24 | Microsoft Corporation | Classification filter for processing data for creating a language model |
US7975010B1 (en) * | 2005-03-23 | 2011-07-05 | Symantec Corporation | Countering spam through address comparison |
US8201254B1 (en) * | 2005-08-30 | 2012-06-12 | Symantec Corporation | Detection of e-mail threat acceleration |
US7617285B1 (en) * | 2005-09-29 | 2009-11-10 | Symantec Corporation | Adaptive threshold based spam classification |
DE102005046939A1 (en) * | 2005-09-30 | 2007-04-12 | Siemens Ag | Methods and apparatus for preventing the reception of unwanted messages in an IP communication network |
US8065370B2 (en) | 2005-11-03 | 2011-11-22 | Microsoft Corporation | Proofs to filter spam |
US8272064B2 (en) * | 2005-11-16 | 2012-09-18 | The Boeing Company | Automated rule generation for a secure downgrader |
US7769751B1 (en) * | 2006-01-17 | 2010-08-03 | Google Inc. | Method and apparatus for classifying documents based on user inputs |
US7627641B2 (en) * | 2006-03-09 | 2009-12-01 | Watchguard Technologies, Inc. | Method and system for recognizing desired email |
US8364467B1 (en) | 2006-03-31 | 2013-01-29 | Google Inc. | Content-based classification |
US20070260568A1 (en) | 2006-04-21 | 2007-11-08 | International Business Machines Corporation | System and method of mining time-changing data streams using a dynamic rule classifier having low granularity |
US7685201B2 (en) * | 2006-09-08 | 2010-03-23 | Microsoft Corporation | Person disambiguation using name entity extraction-based clustering |
US8224905B2 (en) * | 2006-12-06 | 2012-07-17 | Microsoft Corporation | Spam filtration utilizing sender activity data |
US8554622B2 (en) * | 2006-12-18 | 2013-10-08 | Yahoo! Inc. | Evaluating performance of binary classification systems |
US7562088B2 (en) * | 2006-12-27 | 2009-07-14 | Sap Ag | Structure extraction from unstructured documents |
US8086675B2 (en) | 2007-07-12 | 2011-12-27 | International Business Machines Corporation | Generating a fingerprint of a bit sequence |
US7941437B2 (en) | 2007-08-24 | 2011-05-10 | Symantec Corporation | Bayesian surety check to reduce false positives in filtering of content in non-trained languages |
US7890590B1 (en) | 2007-09-27 | 2011-02-15 | Symantec Corporation | Variable bayesian handicapping to provide adjustable error tolerance level |
US8428367B2 (en) * | 2007-10-26 | 2013-04-23 | International Business Machines Corporation | System and method for electronic document classification |
US9595008B1 (en) | 2007-11-19 | 2017-03-14 | Timothy P. Heikell | Systems, methods, apparatus for evaluating status of computing device user |
US8370930B2 (en) * | 2008-02-28 | 2013-02-05 | Microsoft Corporation | Detecting spam from metafeatures of an email message |
US8849832B2 (en) * | 2008-04-02 | 2014-09-30 | Honeywell International Inc. | Method and system for building a support vector machine binary tree for fast object search |
US8189930B2 (en) * | 2008-07-17 | 2012-05-29 | Xerox Corporation | Categorizer with user-controllable calibration |
US10354229B2 (en) | 2008-08-04 | 2019-07-16 | Mcafee, Llc | Method and system for centralized contact management |
US20100070511A1 (en) * | 2008-09-17 | 2010-03-18 | Microsoft Corporation | Reducing use of randomness in consistent uniform hashing |
US8170966B1 (en) | 2008-11-04 | 2012-05-01 | Bitdefender IPR Management Ltd. | Dynamic streaming message clustering for rapid spam-wave detection |
US8291069B1 (en) | 2008-12-23 | 2012-10-16 | At&T Intellectual Property I, L.P. | Systems, devices, and/or methods for managing sample selection bias |
US8346800B2 (en) * | 2009-04-02 | 2013-01-01 | Microsoft Corporation | Content-based information retrieval |
US9020944B2 (en) * | 2009-10-29 | 2015-04-28 | International Business Machines Corporation | Systems and methods for organizing documented processes |
WO2011081950A1 (en) * | 2009-12-14 | 2011-07-07 | Massachussets Institute Of Technology | Methods, systems and media utilizing ranking techniques in machine learning |
EP2531942A4 (en) * | 2010-02-03 | 2013-10-16 | Arcode Corp | Electronic message systems and methods |
US10185477B1 (en) | 2013-03-15 | 2019-01-22 | Narrative Science Inc. | Method and system for configuring automatic generation of narratives from data |
US10366341B2 (en) | 2011-05-11 | 2019-07-30 | Oath Inc. | Mining email inboxes for suggesting actions |
US9407463B2 (en) | 2011-07-11 | 2016-08-02 | Aol Inc. | Systems and methods for providing a spam database and identifying spam communications |
US8954458B2 (en) * | 2011-07-11 | 2015-02-10 | Aol Inc. | Systems and methods for providing a content item database and identifying content items |
CN102629261B (en) * | 2012-03-01 | 2014-07-16 | 南京邮电大学 | Method for finding landing page from phishing page |
US9146895B2 (en) * | 2012-09-26 | 2015-09-29 | International Business Machines Corporation | Estimating the time until a reply email will be received using a recipient behavior model |
US9235562B1 (en) * | 2012-10-02 | 2016-01-12 | Symantec Corporation | Systems and methods for transparent data loss prevention classifications |
US11470036B2 (en) | 2013-03-14 | 2022-10-11 | Microsoft Technology Licensing, Llc | Email assistant for efficiently managing emails |
US9160680B1 (en) | 2014-11-18 | 2015-10-13 | Kaspersky Lab Zao | System and method for dynamic network resource categorization re-assignment |
US9818066B1 (en) * | 2015-02-17 | 2017-11-14 | Amazon Technologies, Inc. | Automated development and utilization of machine-learning generated classifiers |
RU2634180C1 (en) * | 2016-06-24 | 2017-10-24 | Акционерное общество "Лаборатория Касперского" | System and method for determining spam-containing message by topic of message sent via e-mail |
US10594640B2 (en) * | 2016-12-01 | 2020-03-17 | Oath Inc. | Message classification |
US10673796B2 (en) * | 2017-01-31 | 2020-06-02 | Microsoft Technology Licensing, Llc | Automated email categorization and rule creation for email management |
US10397252B2 (en) | 2017-10-13 | 2019-08-27 | Bank Of America Corporation | Dynamic detection of unauthorized activity in multi-channel system |
US10659483B1 (en) | 2017-10-31 | 2020-05-19 | EMC IP Holding Company LLC | Automated agent for data copies verification |
US10664619B1 (en) * | 2017-10-31 | 2020-05-26 | EMC IP Holding Company LLC | Automated agent for data copies verification |
US10963649B1 (en) | 2018-01-17 | 2021-03-30 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation using an invocable analysis service and configuration-driven analytics |
US11075930B1 (en) | 2018-06-27 | 2021-07-27 | Fireeye, Inc. | System and method for detecting repetitive cybersecurity attacks constituting an email campaign |
US10990767B1 (en) | 2019-01-28 | 2021-04-27 | Narrative Science Inc. | Applied artificial intelligence technology for adaptive natural language understanding |
US10992796B1 (en) | 2020-04-01 | 2021-04-27 | Bank Of America Corporation | System for device customization based on beacon-determined device location |
US11483270B2 (en) * | 2020-11-24 | 2022-10-25 | Oracle International Corporation | Email filtering system for email, delivery systems |
US11381537B1 (en) | 2021-06-11 | 2022-07-05 | Oracle International Corporation | Message transfer agent architecture for email delivery systems |
Citations (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835087A (en) * | 1994-11-29 | 1998-11-10 | Herz; Frederick S. M. | System for generation of object profiles for a system for customized electronic identification of desirable objects |
US6018761A (en) * | 1996-12-11 | 2000-01-25 | The Robert G. Uomini And Louise B. Bidwell Trust | System for adding to electronic mail messages information obtained from sources external to the electronic mail transport process |
US6029195A (en) * | 1994-11-29 | 2000-02-22 | Herz; Frederick S. M. | System for customized electronic identification of desirable objects |
US6073142A (en) * | 1997-06-23 | 2000-06-06 | Park City Group | Automated post office based rule analysis of e-mail messages and other data objects for controlled distribution in network environments |
US6161130A (en) * | 1998-06-23 | 2000-12-12 | Microsoft Corporation | Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set |
US6199103B1 (en) * | 1997-06-24 | 2001-03-06 | Omron Corporation | Electronic mail determination method and system and storage medium |
US6236507B1 (en) * | 1998-04-17 | 2001-05-22 | Zygo Corporation | Apparatus to transform two nonparallel propagating optical beam components into two orthogonally polarized beam components |
US6240424B1 (en) * | 1998-04-22 | 2001-05-29 | Nbc Usa, Inc. | Method and system for similarity-based image classification |
US20010032245A1 (en) * | 1999-12-22 | 2001-10-18 | Nicolas Fodor | Industrial capacity clustered mail server system and method |
US6330590B1 (en) * | 1999-01-05 | 2001-12-11 | William D. Cotten | Preventing delivery of unwanted bulk e-mail |
US6349296B1 (en) * | 1998-03-26 | 2002-02-19 | Altavista Company | Method for clustering closely resembling data objects |
US20020055940A1 (en) * | 2000-11-07 | 2002-05-09 | Charles Elkan | Method and system for selecting documents by measuring document quality |
US20020078441A1 (en) * | 2000-08-31 | 2002-06-20 | Eddie Drake | Real-time audience monitoring, content rating, and content enhancing |
US6421709B1 (en) * | 1997-12-22 | 2002-07-16 | Accepted Marketing, Inc. | E-mail filter and method thereof |
US20020099675A1 (en) * | 2000-04-03 | 2002-07-25 | 3-Dimensional Pharmaceuticals, Inc. | Method, system, and computer program product for representing object relationships in a multidimensional space |
US20020116463A1 (en) * | 2001-02-20 | 2002-08-22 | Hart Matthew Thomas | Unwanted e-mail filtering |
US20020116641A1 (en) * | 2001-02-22 | 2002-08-22 | International Business Machines Corporation | Method and apparatus for providing automatic e-mail filtering based on message semantics, sender's e-mail ID, and user's identity |
US20020147754A1 (en) * | 2001-01-31 | 2002-10-10 | Dempsey Derek M. | Vector difference measures for data classifiers |
US20020181703A1 (en) * | 2001-06-01 | 2002-12-05 | Logan James D. | Methods and apparatus for controlling the transmission and receipt of email messages |
US20020199095A1 (en) * | 1997-07-24 | 2002-12-26 | Jean-Christophe Bandini | Method and system for filtering communication |
US6507866B1 (en) * | 1999-07-19 | 2003-01-14 | At&T Wireless Services, Inc. | E-mail usage pattern detection |
US20030037041A1 (en) * | 1994-11-29 | 2003-02-20 | Pinpoint Incorporated | System for automatic determination of customized prices and promotions |
US20030046421A1 (en) * | 2000-12-12 | 2003-03-06 | Horvitz Eric J. | Controls and displays for acquiring preferences, inspecting behavior, and guiding the learning and decision policies of an adaptive communications prioritization and routing system |
US20030101181A1 (en) * | 2001-11-02 | 2003-05-29 | Khalid Al-Kofahi | Systems, Methods, and software for classifying text from judicial opinions and other documents |
US20030187699A1 (en) * | 2001-12-31 | 2003-10-02 | Bonissone Piero Patrone | System for rule-based insurance underwriting suitable for use by an automated system |
US20040029087A1 (en) * | 2002-08-08 | 2004-02-12 | Rodney White | System and method for training and managing gaming personnel |
US6732149B1 (en) * | 1999-04-09 | 2004-05-04 | International Business Machines Corporation | System and method for hindering undesired transmission or receipt of electronic messages |
US20040128355A1 (en) * | 2002-12-25 | 2004-07-01 | Kuo-Jen Chao | Community-based message classification and self-amending system for a messaging system |
US6901398B1 (en) * | 2001-02-12 | 2005-05-31 | Microsoft Corporation | System and method for constructing and personalizing a universal information classifier |
US7089238B1 (en) * | 2001-06-27 | 2006-08-08 | Inxight Software, Inc. | Method and apparatus for incremental computation of the accuracy of a categorization-by-example system |
US7249175B1 (en) * | 1999-11-23 | 2007-07-24 | Escom Corporation | Method and system for blocking e-mail having a nonexistent sender address |
US7636683B1 (en) * | 1998-09-11 | 2009-12-22 | Ebs Group Limited | Communication of credit filtered prices in an electronic brokerage system |
US7971150B2 (en) * | 2000-09-25 | 2011-06-28 | Telstra New Wave Pty Ltd. | Document categorisation system |
-
2003
- 2003-12-22 US US10/740,821 patent/US7089241B1/en not_active Expired - Fee Related
-
2004
- 2004-01-23 WO PCT/US2004/001788 patent/WO2004068288A2/en active Application Filing
-
2006
- 2006-04-26 US US11/380,375 patent/US20060190481A1/en not_active Abandoned
Patent Citations (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6029195A (en) * | 1994-11-29 | 2000-02-22 | Herz; Frederick S. M. | System for customized electronic identification of desirable objects |
US5835087A (en) * | 1994-11-29 | 1998-11-10 | Herz; Frederick S. M. | System for generation of object profiles for a system for customized electronic identification of desirable objects |
US20030037041A1 (en) * | 1994-11-29 | 2003-02-20 | Pinpoint Incorporated | System for automatic determination of customized prices and promotions |
US6018761A (en) * | 1996-12-11 | 2000-01-25 | The Robert G. Uomini And Louise B. Bidwell Trust | System for adding to electronic mail messages information obtained from sources external to the electronic mail transport process |
US6073142A (en) * | 1997-06-23 | 2000-06-06 | Park City Group | Automated post office based rule analysis of e-mail messages and other data objects for controlled distribution in network environments |
US6199103B1 (en) * | 1997-06-24 | 2001-03-06 | Omron Corporation | Electronic mail determination method and system and storage medium |
US20020199095A1 (en) * | 1997-07-24 | 2002-12-26 | Jean-Christophe Bandini | Method and system for filtering communication |
US6421709B1 (en) * | 1997-12-22 | 2002-07-16 | Accepted Marketing, Inc. | E-mail filter and method thereof |
US6349296B1 (en) * | 1998-03-26 | 2002-02-19 | Altavista Company | Method for clustering closely resembling data objects |
US6236507B1 (en) * | 1998-04-17 | 2001-05-22 | Zygo Corporation | Apparatus to transform two nonparallel propagating optical beam components into two orthogonally polarized beam components |
US6240424B1 (en) * | 1998-04-22 | 2001-05-29 | Nbc Usa, Inc. | Method and system for similarity-based image classification |
US6161130A (en) * | 1998-06-23 | 2000-12-12 | Microsoft Corporation | Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set |
US7636683B1 (en) * | 1998-09-11 | 2009-12-22 | Ebs Group Limited | Communication of credit filtered prices in an electronic brokerage system |
US6330590B1 (en) * | 1999-01-05 | 2001-12-11 | William D. Cotten | Preventing delivery of unwanted bulk e-mail |
US6732149B1 (en) * | 1999-04-09 | 2004-05-04 | International Business Machines Corporation | System and method for hindering undesired transmission or receipt of electronic messages |
US6507866B1 (en) * | 1999-07-19 | 2003-01-14 | At&T Wireless Services, Inc. | E-mail usage pattern detection |
US7249175B1 (en) * | 1999-11-23 | 2007-07-24 | Escom Corporation | Method and system for blocking e-mail having a nonexistent sender address |
US20010032245A1 (en) * | 1999-12-22 | 2001-10-18 | Nicolas Fodor | Industrial capacity clustered mail server system and method |
US20020099675A1 (en) * | 2000-04-03 | 2002-07-25 | 3-Dimensional Pharmaceuticals, Inc. | Method, system, and computer program product for representing object relationships in a multidimensional space |
US20020078441A1 (en) * | 2000-08-31 | 2002-06-20 | Eddie Drake | Real-time audience monitoring, content rating, and content enhancing |
US7971150B2 (en) * | 2000-09-25 | 2011-06-28 | Telstra New Wave Pty Ltd. | Document categorisation system |
US7200606B2 (en) * | 2000-11-07 | 2007-04-03 | The Regents Of The University Of California | Method and system for selecting documents by measuring document quality |
US20020055940A1 (en) * | 2000-11-07 | 2002-05-09 | Charles Elkan | Method and system for selecting documents by measuring document quality |
US20030046421A1 (en) * | 2000-12-12 | 2003-03-06 | Horvitz Eric J. | Controls and displays for acquiring preferences, inspecting behavior, and guiding the learning and decision policies of an adaptive communications prioritization and routing system |
US20020147754A1 (en) * | 2001-01-31 | 2002-10-10 | Dempsey Derek M. | Vector difference measures for data classifiers |
US6901398B1 (en) * | 2001-02-12 | 2005-05-31 | Microsoft Corporation | System and method for constructing and personalizing a universal information classifier |
US20020116463A1 (en) * | 2001-02-20 | 2002-08-22 | Hart Matthew Thomas | Unwanted e-mail filtering |
US20020116641A1 (en) * | 2001-02-22 | 2002-08-22 | International Business Machines Corporation | Method and apparatus for providing automatic e-mail filtering based on message semantics, sender's e-mail ID, and user's identity |
US20020181703A1 (en) * | 2001-06-01 | 2002-12-05 | Logan James D. | Methods and apparatus for controlling the transmission and receipt of email messages |
US7089238B1 (en) * | 2001-06-27 | 2006-08-08 | Inxight Software, Inc. | Method and apparatus for incremental computation of the accuracy of a categorization-by-example system |
US7062498B2 (en) * | 2001-11-02 | 2006-06-13 | Thomson Legal Regulatory Global Ag | Systems, methods, and software for classifying text from judicial opinions and other documents |
US20030101181A1 (en) * | 2001-11-02 | 2003-05-29 | Khalid Al-Kofahi | Systems, Methods, and software for classifying text from judicial opinions and other documents |
US20030187699A1 (en) * | 2001-12-31 | 2003-10-02 | Bonissone Piero Patrone | System for rule-based insurance underwriting suitable for use by an automated system |
US20040029087A1 (en) * | 2002-08-08 | 2004-02-12 | Rodney White | System and method for training and managing gaming personnel |
US20040128355A1 (en) * | 2002-12-25 | 2004-07-01 | Kuo-Jen Chao | Community-based message classification and self-amending system for a messaging system |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040148330A1 (en) * | 2003-01-24 | 2004-07-29 | Joshua Alspector | Group based spam classification |
US7725544B2 (en) | 2003-01-24 | 2010-05-25 | Aol Inc. | Group based spam classification |
US8504627B2 (en) | 2003-01-24 | 2013-08-06 | Bright Sun Technologies | Group based spam classification |
US9037660B2 (en) | 2003-05-09 | 2015-05-19 | Google Inc. | Managing electronic messages |
US9576271B2 (en) | 2003-06-24 | 2017-02-21 | Google Inc. | System and method for community centric resource sharing based on a publishing subscription model |
US7555524B1 (en) * | 2004-09-16 | 2009-06-30 | Symantec Corporation | Bulk electronic message detection by header similarity analysis |
US7831677B1 (en) * | 2004-09-16 | 2010-11-09 | Symantec Corporation | Bulk electronic message detection by header similarity analysis |
US20080103849A1 (en) * | 2006-10-31 | 2008-05-01 | Forman George H | Calculating an aggregate of attribute values associated with plural cases |
US20080126493A1 (en) * | 2006-11-29 | 2008-05-29 | Mcafee, Inc | Scanner-driven email message decomposition |
US8560614B2 (en) * | 2006-11-29 | 2013-10-15 | Mcafee, Inc. | Scanner-driven email message decomposition |
US10095922B2 (en) | 2007-01-11 | 2018-10-09 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8290311B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8290203B1 (en) | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US20150039983A1 (en) * | 2007-02-20 | 2015-02-05 | Yahoo! Inc. | System and method for customizing a user interface |
US7958065B2 (en) | 2008-03-18 | 2011-06-07 | International Business Machines Corporation | Resilient classifier for rule-based system |
US20090254989A1 (en) * | 2008-04-03 | 2009-10-08 | Microsoft Corporation | Clustering botnet behavior using parameterized models |
US8745731B2 (en) * | 2008-04-03 | 2014-06-03 | Microsoft Corporation | Clustering botnet behavior using parameterized models |
US20100082627A1 (en) * | 2008-09-24 | 2010-04-01 | Yahoo! Inc. | Optimization filters for user generated content searches |
US8793249B2 (en) * | 2008-09-24 | 2014-07-29 | Yahoo! Inc. | Optimization filters for user generated content searches |
US20100185577A1 (en) * | 2009-01-16 | 2010-07-22 | Microsoft Corporation | Object classification using taxonomies |
US8275726B2 (en) * | 2009-01-16 | 2012-09-25 | Microsoft Corporation | Object classification using taxonomies |
US20110103682A1 (en) * | 2009-10-29 | 2011-05-05 | Xerox Corporation | Multi-modality classification for one-class classification in social networks |
US8386574B2 (en) * | 2009-10-29 | 2013-02-26 | Xerox Corporation | Multi-modality classification for one-class classification in social networks |
US9111282B2 (en) * | 2011-03-31 | 2015-08-18 | Google Inc. | Method and system for identifying business records |
US20130339276A1 (en) * | 2012-02-10 | 2013-12-19 | International Business Machines Corporation | Multi-tiered approach to e-mail prioritization |
US9256862B2 (en) * | 2012-02-10 | 2016-02-09 | International Business Machines Corporation | Multi-tiered approach to E-mail prioritization |
US9152953B2 (en) * | 2012-02-10 | 2015-10-06 | International Business Machines Corporation | Multi-tiered approach to E-mail prioritization |
US20130212047A1 (en) * | 2012-02-10 | 2013-08-15 | International Business Machines Corporation | Multi-tiered approach to e-mail prioritization |
US10091556B1 (en) * | 2012-12-12 | 2018-10-02 | Imdb.Com, Inc. | Relating items to objects detected in media |
US10528907B2 (en) * | 2012-12-19 | 2020-01-07 | Oath Inc. | Automated categorization of products in a merchant catalog |
US20140172652A1 (en) * | 2012-12-19 | 2014-06-19 | Yahoo! Inc. | Automated categorization of products in a merchant catalog |
US20160055145A1 (en) * | 2014-08-19 | 2016-02-25 | Sandeep Chauhan | Essay manager and automated plagiarism detector |
US11140115B1 (en) * | 2014-12-09 | 2021-10-05 | Google Llc | Systems and methods of applying semantic features for machine learning of message categories |
US20220132048A1 (en) * | 2020-10-26 | 2022-04-28 | Genetec Inc. | Systems and methods for producing a privacy-protected video clip |
US11653052B2 (en) * | 2020-10-26 | 2023-05-16 | Genetec Inc. | Systems and methods for producing a privacy-protected video clip |
US20230209115A1 (en) * | 2021-12-28 | 2023-06-29 | The Adt Security Corporation | Video rights management for an in-cabin monitoring system |
US11729445B2 (en) * | 2021-12-28 | 2023-08-15 | The Adt Security Corporation | Video rights management for an in-cabin monitoring system |
US11831936B2 (en) * | 2021-12-28 | 2023-11-28 | The Adt Security Corporation | Video rights management for an in-cabin monitoring system |
Also Published As
Publication number | Publication date |
---|---|
WO2004068288A2 (en) | 2004-08-12 |
WO2004068288A3 (en) | 2005-04-28 |
US7089241B1 (en) | 2006-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7089241B1 (en) | Classifier tuning based on data similarities | |
US7577709B1 (en) | Reliability measure for a classifier | |
US8504627B2 (en) | Group based spam classification | |
US7725475B1 (en) | Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems | |
US8799387B2 (en) | Online adaptive filtering of messages | |
US7984029B2 (en) | Reliability of duplicate document detection algorithms | |
US20040083270A1 (en) | Method and system for identifying junk e-mail | |
US7222157B1 (en) | Identification and filtration of digital communications | |
EP1564670B1 (en) | Intelligent quarantining for spam prevention | |
Firte et al. | Spam detection filter using KNN algorithm and resampling | |
US7930351B2 (en) | Identifying undesired email messages having attachments | |
US7949718B2 (en) | Phonetic filtering of undesired email messages | |
US20060265498A1 (en) | Detection and prevention of spam | |
US20040003283A1 (en) | Spam detector with challenges | |
US20050283519A1 (en) | Methods and systems for combating spam | |
US7624274B1 (en) | Decreasing the fragility of duplicate document detecting algorithms | |
Lazzari et al. | Cafe-collaborative agents for filtering e-mails | |
KR20050078311A (en) | Method and system for detecting and managing spam mails for multiple mail servers | |
Al Abid | Designing Spam Filtering by Analyzing User and Email Behaviour |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AMERICA ONLINE, INC., VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ALSPECTOR, JOSHUA;KOLCZ, ALEKSANDER;CHOWDHURY, ABDUR R.;REEL/FRAME:018265/0878;SIGNING DATES FROM 20040329 TO 20040415 |
|
AS | Assignment |
Owner name: BANK OF AMERICAN, N.A. AS COLLATERAL AGENT,TEXAS Free format text: SECURITY AGREEMENT;ASSIGNORS:AOL INC.;AOL ADVERTISING INC.;BEBO, INC.;AND OTHERS;REEL/FRAME:023649/0061 Effective date: 20091209 Owner name: BANK OF AMERICAN, N.A. AS COLLATERAL AGENT, TEXAS Free format text: SECURITY AGREEMENT;ASSIGNORS:AOL INC.;AOL ADVERTISING INC.;BEBO, INC.;AND OTHERS;REEL/FRAME:023649/0061 Effective date: 20091209 |
|
AS | Assignment |
Owner name: AOL LLC,VIRGINIA Free format text: CHANGE OF NAME;ASSIGNOR:AMERICA ONLINE, INC.;REEL/FRAME:023723/0585 Effective date: 20060403 Owner name: AOL INC.,VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AOL LLC;REEL/FRAME:023723/0645 Effective date: 20091204 Owner name: AOL LLC, VIRGINIA Free format text: CHANGE OF NAME;ASSIGNOR:AMERICA ONLINE, INC.;REEL/FRAME:023723/0585 Effective date: 20060403 Owner name: AOL INC., VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AOL LLC;REEL/FRAME:023723/0645 Effective date: 20091204 |
|
AS | Assignment |
Owner name: AOL ADVERTISING INC, NEW YORK Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: TRUVEO, INC, CALIFORNIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: YEDDA, INC, VIRGINIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: LIGHTNINGCAST LLC, NEW YORK Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: MAPQUEST, INC, COLORADO Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: TACODA LLC, NEW YORK Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: SPHERE SOURCE, INC, VIRGINIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: QUIGO TECHNOLOGIES LLC, NEW YORK Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: AOL INC, VIRGINIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: NETSCAPE COMMUNICATIONS CORPORATION, VIRGINIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 Owner name: GOING INC, MASSACHUSETTS Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:BANK OF AMERICA, N A;REEL/FRAME:025323/0416 Effective date: 20100930 |
|
AS | Assignment |
Owner name: MARATHON SOLUTIONS LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AOL INC.;REEL/FRAME:028911/0969 Effective date: 20120614 |
|
AS | Assignment |
Owner name: BRIGHT SUN TECHNOLOGIES, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MARATHON SOLUTIONS LLC;REEL/FRAME:030091/0483 Effective date: 20130312 |
|
XAS | Not any more in us assignment database |
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MARATHON SOLUTIONS LLC;REEL/FRAME:030091/0483 |
|
AS | Assignment |
Owner name: BRIGHT SUN TECHNOLOGIES, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MARATHON SOLUTIONS LLC;REEL/FRAME:031900/0494 Effective date: 20130312 |
|
AS | Assignment |
Owner name: GOOGLE INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BRIGHT SUN TECHNOLOGIES;REEL/FRAME:033074/0009 Effective date: 20140128 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044142/0357 Effective date: 20170929 |