CN101046858B - Electronic information comparing system and method and anti-garbage mail system - Google Patents

Electronic information comparing system and method and anti-garbage mail system Download PDF

Info

Publication number
CN101046858B
CN101046858B CN2006100600947A CN200610060094A CN101046858B CN 101046858 B CN101046858 B CN 101046858B CN 2006100600947 A CN2006100600947 A CN 2006100600947A CN 200610060094 A CN200610060094 A CN 200610060094A CN 101046858 B CN101046858 B CN 101046858B
Authority
CN
China
Prior art keywords
matrix
text
comparison
round values
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2006100600947A
Other languages
Chinese (zh)
Other versions
CN101046858A (en
Inventor
王晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2006100600947A priority Critical patent/CN101046858B/en
Publication of CN101046858A publication Critical patent/CN101046858A/en
Application granted granted Critical
Publication of CN101046858B publication Critical patent/CN101046858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses an electronic text comparison method. Said method includes the following steps: (a), converting first electronic text segment and second electronic text segment into first matrix and second matrix respectively according to identical conversion rule, the described first matrix is identical to second matrix in size; (b), successively comparing elements of identical positions of first matrix and second matrix, and according to the compared result using defined comparison function to calculated similar coefficient; and (c), according to similar coefficient judging that the described first electronic text and second electronic text are similar or not, if the described similar coefficient is greater than specific threshold value, it expresses that first electronic data segment and second electronic data segment are similar. Besides, said invention also discloses a correspondent electronic text comparison system and a system for resisting garbage mail.

Description

Electronic information comparing system and method and anti-garbage mail system
Technical field
The present invention relates to field of computer technology, more particularly, relate to a kind of electronic information comparing system and method and anti-garbage mail system.
Background technology
Along with Internet development, increasing people brings into use Email to intercom mutually.Following also the appearance caused the inconvenience of user's use in the E-mail address that a large amount of spam (for example malice harassing and wrecking mail and advertisement matter etc.) is full of the user.
A key character of spam is exactly to send a large amount of identical mails.Though the spammer just progressively changes the strategy that sends, and changes the details such as content format that send mail, a large amount of and identical feature still can't change.Spam identification has more and more depended on having the quick identification of similar content mail in a large number like this.When the similar content mail of identification, efficient is an important consideration point of such technology, especially is applied in the anti-garbage mail system on the large-scale mail server simultaneously.
Mail comparison techniques based on the MD5 verification is the more anti-rubbish mail scheme of using at present.This scheme changes into the value of short regular length by the data character string of random length is carried out hash operations.Because the MD5 value of any two kinds of characters strings is inequality, therefore can judge that two character strings are identical by the MD5 value that compares two character strings.
Though yet present quick based on the MD5 method of calibration, it also has fatal shortcoming:, all can cause the difference of MD5 value when any variation occurring, thereby influence judged result if the non-strictness of Mail Contents is identical.Because the MD5 value is identical is the prerequisite of identification identical content mail, as long as the spammer changes Mail Contents a little, just can avoid the MD5 verification.And walk around the MD5 verification has been the problem that the spammer can solve easily.
Adopt general character string/text similarity method to judge the mail similarity in addition in addition.These class methods are often used editing distance, promptly calculate from the number of needed minimum insertion, deletion and the replacement of former string (s) converting into target string (t) and judge similarity, it uses more extensive in NLP (natural language processing), also is commonly used to calculate simultaneously the change number that former text is done.Yet this method often needs Recursive Implementation, though fast and effectively for short character strings, for a large amount of message bodies, the computing cost prohibitive.
In addition, along with the increase of electronic information, increasing place need compare two sections electronic information, thereby judges the similarity of two sections electronic information.For example need in the search engine webpage of similar content is merged to reduce Search Results, JICQ or chatroom shield identical content etc. for preventing brush.The problem that existing comparative approach ubiquity judgment accuracy is not high or judging efficiency is lower in these are used.
Summary of the invention
The technical problem to be solved in the present invention is, at above-mentioned existing electronic information comparison techniques in efficient is lower or accuracy is not high defective, a kind of e-text comparative approach and system are provided.
The present invention also carries out the low and not high problem of accuracy rate of efficient at existing anti-garbage mail system, and a kind of new anti-garbage mail system is provided.
The technical solution adopted for the present invention to solve the technical problems is: construct a kind of e-text comparative approach, may further comprise the steps:
(a) the first e-text section is converted to first matrix and second matrix according to identical transformation rule respectively with the second e-text section, described first matrix has identical size with second matrix;
(b) element of the same position of more described successively first matrix and second matrix and use to specify comparison function to calculate similarity coefficient according to comparative result;
(c) judge according to described similarity coefficient whether the described first e-text section is similar with the second e-text section, if described similarity coefficient represents then that greater than assign thresholds the first e-text section is similar to the second e-text section;
In the wherein said step (a) the e-text section being converted to matrix may further comprise the steps:
(a1) described e-text section is decoded as word string, described word string is divided into one or more sections according to the structure of primary electron text chunk;
(a2) described each section word string is converted into the round values sequence, the numerical range of described integer is 0-255;
(a3) number of times that each round values in the specified window scope after each round values is occurred is successively as the entry of a matrix element, and successively each round values in the described round values sequence operated the composition matrix.
In e-text comparative approach of the present invention, the specified window size in the described step (a3) is 10-20.
In e-text comparative approach of the present invention, described step (b) may further comprise the steps:
(b1) order travels through the element that is in same position in described first matrix and second matrix and the comparator matrix, and the total quantity of same position element value coupling is designated as M, the unmatched total quantity of same position element value is recorded as D;
(b2) calculate similarity coefficient:
Figure DEST_PATH_GSB00000523530500021
Wherein S1 and S2 be respectively the text length of first matrix and second matrix or be in first matrix and second matrix greater than the number of zero element, and S1>S2.
In e-text comparative approach of the present invention, mate when statistics in described step (b1), the ratio of the value of described first matrix and the second matrix correspondence position, first and second elements is all non-vanishing and first element and second element is coupling with this position statistics between the inverse of statistical value and statistical value the time; Otherwise statistics is not for matching.
The present invention also provides a kind of e-text comparison system, at least comprise matrix conversion module and matrix comparison module, described matrix conversion module is used for according to transformation rule the e-text section being converted to matrix, described matrix comparison module be used for two matrixes of comparison same position element and use to specify comparison function to calculate similarity coefficient according to comparative result, described matrix conversion module includes the decoding submodule that described e-text section is decoded as one or more word string sections according to the structure of e-text section, described each section word string is converted into rounding submodule and the number of times that each round values in the specified window scope after each round values occurs also being operated group a period of time module of forming matrix to each round values in the described round values sequence successively as entry of a matrix is plain successively of round values sequence.
In e-text comparison system of the present invention, described matrix comparison module includes order and travels through the element that is in same position in two matrixes to be compared and the comparator matrix and the total quantity of same position element value coupling is designated as M, the unmatched total quantity of same position element value is recorded as the statistics submodule of D and calculates the calculating sub module of similarity coefficient S
Wherein S1 and S2 be respectively described matrix to be compared text length or in the described matrix to be compared greater than the number of zero element, and S1>S2.
The present invention also provides a kind of anti-garbage mail system, include the matrix conversion control center that connects successively, parallel processing element, matrix is control center relatively, the judging rubbish mail center, described matrix conversion control center shifts to behind a plurality of transition matrixes Email in enormous quantities is parallel, carry out the comparison of transition matrix by described parallel processing element, and go out final relatively conclusion by the sorting-out in statistics of described matrix comparison control center, described judging rubbish mail center judges according to the final relatively conclusion of matrix comparison control center whether the Email of input matrix correspondence is spam, when Email is converted to transition matrix, at first the e-text section with Email is decoded as word string, described word string is divided into one or more sections according to the structure of primary electron text chunk, then described each section word string is converted into the round values sequence, the numerical range of described integer is 0-255, the number of times that each round values in the specified window scope after each round values is occurred is successively as the entry of a matrix element at last, and successively each round values in the described round values sequence operated the composition matrix.
In anti-garbage mail system of the present invention, described parallel processing element comprises a plurality of server units, each server unit comprises transition matrix data module and comparison module, wherein said transition matrix data module is used to realize basic data management and stores one or more transition matrixes, and described comparison module is used to realize the comparison to the transition matrix and all matrixes in the transition matrix data module of an input.
The similarity of e-text is calculated by the transition matrix of e-text relatively by e-text comparative approach of the present invention and system, not only can identify the on all four e-text of content, can also identify the e-text that has inserted certain random character.Anti-garbage mail system of the present invention is discerned spam by the similarity of identification Email, and the accuracy rate of identification is higher.
Description of drawings
The invention will be further described below in conjunction with drawings and Examples, in the accompanying drawing:
Fig. 1 is the structural representation of e-text comparison system of the present invention;
Fig. 2 is the structural representation of matrix conversion module and matrix comparison module among Fig. 1;
Fig. 3 is the process flow diagram of e-text comparative approach of the present invention;
Fig. 4 is the structural representation of anti-garbage mail system of the present invention.
Embodiment
As shown in Figure 1, in e-text comparison system of the present invention, include matrix conversion module 11 and matrix comparison module 12.Wherein matrix conversion module 11 can be read in the e-text section, and in the present embodiment, the e-text section can be the text data of various forms and numerical data etc.
Matrix conversion module 11 is used to read in e-text, and e-text is converted to transition matrix.In the present embodiment, the transition matrix that matrix conversion module 11 generates is a matrix that full-size is 256x256, and this matrix is a sparse matrix.Matrix conversion module 11 is when changing, do not consider the coded format and the wide character attribute of e-text, each letter is all treated according to 8 bits (bit), the integer span of each character of e-text is exactly 0-255 like this, then add up each character and on every side the frequency that occurs simultaneously of other characters (can regard a letter as and transfer to another alphabetical frequency, transfer just), so just obtain the transition matrix of a maximum 256x256.In actual applications, no matter e-text how long can be converted into such transition matrix, and the architectural feature of this rule has also been brought the facility of storage/aspects such as calculating simultaneously.
Matrix comparison module 12 is used for any two transition matrixes that generate via matrix conversion module 11 are compared, thereby compares the similarity of the e-text of two sections correspondences.
As shown in Figure 2, matrix conversion module 11 includes successively the decoding submodule 111 that connects, rounds submodule 112 and group a period of time module 113.
The structure that decoding submodule 111 bases are read in e-text is decoded as one or more word string sections with the e-text section.Wherein the part of the scale-of-two in the e-text makes trade-offs as required, only document retaining name, word file are considered sign (TAG) information etc. of removing of format transformation, html format as files such as images, simultaneously draw fragmented blocks, keep information such as new line according to e-text text institutional framework.In actual applications, can whether keep scale-of-two according to concrete target decision partly makes comparisons.
Rounding submodule 11 is used for each section word string of decoding submodule 111 decoding gained is converted into the round values sequence, when transforming, do not consider coded format and wide character attribute, all treat according to 8 bits (bit), be converted into the round values sequence like this, numerical range is 0-255 just.For example " ABCDEFGB " is converted into " 65 66 67 68 69 70 71 66 ".
Group a period of time module 113 will be above-mentioned rounds the round values that submodule 11 obtains and scans successively from accomplishing the right side, adds up the number of times of each round values and other characters appearance in certain window ranges (window) thereafter, as the element of transition matrix.For example when window is 3, expression is only added up A (65) and the occurrence number of three character B (66) C (67) D (68) of following closely thereafter, generally adopt A (65)=>B (66) C (67) D (68) represents.In the present embodiment, window size is generally got 10-20.When being 10, adds up by window the matrix that can obtain below similar successively:
? ... B(66) ?C(67) ?D(68) ?E(69) ?F(70) ?G(71) ...
A(65) ? ?2 ?1 ?1 ?1 ?1 ?1 ?
B(66) ? ?1 ?1 ?1 ?1 ?1 ?1 ?
... ? ? ? ? ? ? ? ?
Like this, with the X/Y axle of round values as transition matrix, scan all word strings successively, each e-text can obtain the transition matrix of a 256x256.Should be noted that the most of value of this transition matrix is zero (for sparse matrix).Available matrix[x] [y]=v represents any one element of above-mentioned matrix, then the A situation that occurs 2 B thereafter just can be expressed as: x=65, y=66, matrix[65] [66]=2.
Matrix comparison module 12 includes statistics submodule 121 and calculating sub module 122.Statistics submodule 121 orders travel through the element that is in same position in two matrixes to be compared and the comparator matrix and the total quantity that the same position element value is mated is designated as M, the unmatched total quantity of same position element value is recorded as D.Calculating sub module 122 is used for the statistics according to statistics submodule 121, calculates similarity coefficient S:
S = M / ( D + M ) S 1 / S 2 ;
Wherein S1 and S2 be respectively described matrix to be compared text length or in the described matrix to be compared greater than the number of zero element, and S1>S2.
As shown in Figure 3, be the process flow diagram of e-text comparative approach of the present invention.It may further comprise the steps:
Step S31: at first will need the first e-text section of comparison to be converted to first matrix and second matrix according to identical transformation rule respectively with the second e-text section, described first matrix has identical size with second matrix.When in this step, carrying out matrix conversion, at first as required e-text section each several part is decoded as one or more word string sections according to structure, wherein binary content makes trade-offs as required, as files such as image document retaining name only, word file is considered format transformation, the TAG information of removing of html format; To decoded each section word string, do not consider coded format and wide character attribute then, all treat that be converted into the round values sequence, numerical range is 0-255 just according to 8bit; From accomplishing the right side, above-mentioned round values sequence is scanned successively then, add up the number of times that each round values and other characters occur in certain window ranges thereafter, and be combined into transition matrix, window size is 10-20 in the present embodiment.
Step S32: the element of the same position of more described successively first matrix and second matrix also use to specify comparison function to calculate similarity coefficient according to comparative result.
In this step, at first order travels through the element that is in same position in first matrix and second matrix and the comparator matrix, and the total quantity of same position element value coupling is designated as M, the unmatched total quantity of same position element value is recorded as D, the element matching rate then is M/ (D+M).In above-mentioned statistics, the ratio of the value of two matrix correspondence position elements is all non-vanishing and two elements is coupling with this position statistics between the inverse of statistical value and statistical value the time; Otherwise statistics is not for matching.Suppose that first, second matrix is respectively matrix1, matrix2, matrix[x] element of the capable y row of x in [y] representing matrix, then above-mentioned statistics can be expressed as:
if(matrix1[x][y]>0&&matrix2[x][y]>0){
R=matrix1[x][y]/matrix2[x][y];
If (1/b<R<b) //b adjusts according to actual conditions, defaults to 5
M++;
else
D++;
}else{
D++;
}
The value of above-mentioned b is represented the strict degree of comparison, if b near 1 then show similar stricter of mail, comparative result trends towards the on all four situation of original contents, then loosens on the contrary.Calculate similarity coefficient then: S = M / ( D + M ) S 1 / S 2 , wherein S1 and S2 be respectively the text length of first matrix and second matrix or be in first matrix and second matrix greater than the number of zero element, and S1 S2.
Step S33: judge according to the value of similarity coefficient S whether first e-text is similar with e-text, if similarity coefficient S represents then that greater than assign thresholds the first electronic data section is similar to the second electronic data section; Otherwise it is dissimilar.In actual applications, as S〉0.8 the time, the accuracy rate of comparative result is higher relatively.Certainly, in concrete the application, can suitably loosen the threshold value of S.The computing formula here can adopt various deformation, but essential element is M and the D value that counts among the step S32.
Above-mentioned e-text comparison system and method can specifically be applied to comparison, the comparison of the similar message in the JICQ or relatively the waiting of the similar message in the chatroom of similar web page in the comparison, search engine of similar Email in the e-mail server.
Said system and method may be used on carrying out in the mailing system identification (according to the identification of mail similarity) of spam, be used as database under promptly preserving by the transition matrix of mail that (as 72 hours) in the certain hour are received, set the threshold value (as 0.75) of a similar mail, suppose that the mail of newly receiving is M, add up the similar mail number that this mail M finds in database, if greater than threshold value (as 50 envelopes), then this mail is judged to be spam, unless proof M is non-spam (having appeared in the white list as M).
As shown in Figure 4, be the embodiment of the anti-garbage mail system realized according to said system and method.In the present embodiment, anti-garbage mail system includes matrix conversion control center 43, parallel processing element 44, matrix relatively control center 45, judging rubbish mail center 46 etc.
Matrix conversion control center 43 is used for parallel transfer of Email in enormous quantities is a plurality of transition matrixes.For mail head's processing in two kinds of situation.In the present embodiment, matrix conversion control center 43 only is converted to transition matrix with Email Body.Though email headers (being title) can be entered matrix by statistics, considers that mail head's structured message is apparent in view, therefore can mail head's separate processes not done conversion and direct the comparison.For the less mail of body matter, this method can significantly reduce the interference of mail head's information, thereby avoids some because the few mail of text literal is judged as similar mail.Also there is dual mode in record for transition matrix: if adopt the complete information of plain text record, every envelope mail all needs 65535 bytes; Also can only write down the matrix intermediate value in addition greater than 0 element, because most of value is 0 (sparse matrix) in the matrix, but conserve storage like this.
Parallel processing element 44 comprises a plurality of server units, and each server unit comprises a transition matrix data module and a comparison module.Wherein the transition matrix data module is used to realize basic data management, and it stores one or more transition matrixes, and comparison module is used to realize the comparison to the transition matrix and all matrixes in the transition matrix data module of an input.
Matrix comparison control center 45 is used to collect comparative result and puts out final relatively conclusion in order, promptly obtains element matching rate M/ (M+D).Judging rubbish mail center 46 is used for the conclusion that provides according to matrix comparison control center 45, judges comprehensively whether the Email of input matrix correspondence is spam.In this judging rubbish mail center 46, can realize as standard techniques such as black/white lists in conjunction with external information.For example, then can compare the mail head one by one, obtain mail head's similar reference value two mails to be compared if the mail head is not advanced transition matrix by statistics, then to text result relatively according to the comprehensive back of certain formula as similar reference value.Equally, also annex can be realized more separately, as following two common-used formulas:
S all=k 1*S H+k2*S+k3*S A
S all=k*S H*S*S A
K wherein 1, k 2, k 3, k is coefficient, can be provided with flexibly according to different application, and S H, S and S ABe respectively mail head's similarity coefficient, message body similarity coefficient and Email attachment similarity coefficient.
Above-mentioned anti-garbage mail system not only can identify the on all four spam of content, the spam that has inserted certain random character can also be identified, (spammer sends by virus, wooden horse control Zombie) spam can be identified and have legal account number in addition.
For satisfying the real-time requirement of mailing system, can also quicken the identification of spam by distributed frame.For example hypothesis has m+1 station server available (every station server handling property is identical), then all transition matrixes are on average stored on the m station server according to number of servers, each transition matrix is unique to be stored on the station server, remaining one is the master control server, unknown mails X for an input, obtain his transition matrix, be sent to the m station server simultaneously and collect the number of comparative result, the master control server is exactly final similar number of mail with the results added of m platform.Judging rubbish mail center 46 determines according to this quantity whether this mail is spam.As seen, the performance of such scheme increases with the increase of number of servers.If consider situations such as redundant storage, disaster recovery, top scheme is revised a little getting final product.As a same reason, mail also can be made distributed structure to the transfer process of transition matrix.
Also can comprise the standard mail system 41 that is used to realize conventional func such as receiving and dispatching mail in this external anti-garbage mail system, realize email storage, the access control center 42 of the inter access control of the mail in the mail server 41 etc. and realize all of spam are handled as transmitted, abandon, return to the sender, stamp the spam subsequent treatment system 47 of rubbish mark etc. etc.
System and method of the present invention also can be applicable in the search engine only to show one that has in the identical content webpage with identification identical content webpage, thereby has reduced the Search Results that repeats.The present invention also can be applicable to seek similar web page in search engine, for example " the similar webpage " among the search engine google.In addition, the present invention also can be applicable in JICQ and the chatroom, prevents user's malice brush screen etc.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (8)

1. an e-text comparative approach is characterized in that, may further comprise the steps:
(a) the first e-text section is converted to first matrix and second matrix according to identical transformation rule respectively with the second e-text section, described first matrix has identical size with second matrix;
(b) element of the same position of more described successively first matrix and second matrix and use to specify comparison function to calculate similarity coefficient according to comparative result;
(c) judge according to described similarity coefficient whether the described first e-text section is similar with the second e-text section, if described similarity coefficient represents then that greater than assign thresholds the first e-text section is similar to the second e-text section;
In the wherein said step (a) the e-text section being converted to matrix may further comprise the steps:
(a1) described e-text section is decoded as word string, described word string is divided into one or more sections according to the structure of primary electron text chunk;
(a2) described each section word string is converted into the round values sequence, the numerical range of described integer is 0-255;
(a3) number of times that each round values in the specified window scope after each round values is occurred is successively as the entry of a matrix element, and successively each round values in the described round values sequence operated the composition matrix.
2. e-text comparative approach according to claim 1 is characterized in that, the specified window size in the described step (a3) is 10-20.
3. e-text comparative approach according to claim 1 is characterized in that, described step (b) may further comprise the steps:
(b1) order travels through the element that is in same position in described first matrix and second matrix and the comparator matrix, and the total quantity of same position element value coupling is designated as M, the unmatched total quantity of same position element value is recorded as D;
(b2) calculate similarity coefficient:
Wherein S1 and S2 be respectively the text length of first matrix and second matrix or be in first matrix and second matrix greater than the number of zero element, and S1>S2.
4. e-text comparative approach according to claim 3, it is characterized in that, mate when statistics in described step (b1), the ratio of the value of described first matrix and the second matrix correspondence position, first and second elements is all non-vanishing and first element and second element is coupling with this position statistics between the inverse of statistical value and statistical value the time; Otherwise statistics is not for matching.
5. e-text comparison system, it is characterized in that, at least comprise matrix conversion module and matrix comparison module, described matrix conversion module is used for according to transformation rule the e-text section being converted to matrix, described matrix comparison module be used for two matrixes of comparison same position element and use to specify comparison function to calculate similarity coefficient according to comparative result, described matrix conversion module includes the decoding submodule that described e-text section is decoded as one or more word string sections according to the structure of e-text section, described each section word string is converted into rounding submodule and the number of times that each round values in the specified window scope after each round values occurs also being operated group a period of time module of forming matrix to each round values in the described round values sequence successively as entry of a matrix is plain successively of round values sequence.
6. e-text comparison system according to claim 5, it is characterized in that, described matrix comparison module includes order and travels through the element that is in same position in two matrixes to be compared and the comparator matrix and the total quantity of same position element value coupling is designated as M, the unmatched total quantity of same position element value is recorded as the statistics submodule of D and calculates the calculating sub module of similarity coefficient S
Figure FSB00000523530400021
Wherein S1 and S2 be respectively described matrix to be compared text length or in the described matrix to be compared greater than the number of zero element, and S1>S2.
7. anti-garbage mail system, it is characterized in that, include the matrix conversion control center that connects successively, parallel processing element, matrix is control center relatively, the judging rubbish mail center, described matrix conversion control center shifts to behind a plurality of transition matrixes Email in enormous quantities is parallel, carry out the comparison of transition matrix by described parallel processing element, and go out final relatively conclusion by the sorting-out in statistics of described matrix comparison control center, described judging rubbish mail center judges according to the final relatively conclusion of matrix comparison control center whether the Email of input matrix correspondence is spam, when Email is converted to transition matrix, at first the e-text section with Email is decoded as word string, described word string is divided into one or more sections according to the structure of primary electron text chunk, then described each section word string is converted into the round values sequence, the numerical range of described integer is 0-255, the number of times that each round values in the specified window scope after each round values is occurred is successively as the entry of a matrix element at last, and successively each round values in the described round values sequence operated the composition matrix.
8. anti-garbage mail system according to claim 7, it is characterized in that, described parallel processing element comprises a plurality of server units, each server unit comprises transition matrix data module and comparison module, wherein said transition matrix data module is used to realize basic data management and stores one or more transition matrixes, and described comparison module is used to realize the comparison to the transition matrix and all matrixes in the transition matrix data module of an input.
CN2006100600947A 2006-03-29 2006-03-29 Electronic information comparing system and method and anti-garbage mail system Active CN101046858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2006100600947A CN101046858B (en) 2006-03-29 2006-03-29 Electronic information comparing system and method and anti-garbage mail system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2006100600947A CN101046858B (en) 2006-03-29 2006-03-29 Electronic information comparing system and method and anti-garbage mail system

Publications (2)

Publication Number Publication Date
CN101046858A CN101046858A (en) 2007-10-03
CN101046858B true CN101046858B (en) 2011-10-05

Family

ID=38771451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2006100600947A Active CN101046858B (en) 2006-03-29 2006-03-29 Electronic information comparing system and method and anti-garbage mail system

Country Status (1)

Country Link
CN (1) CN101046858B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989884B (en) * 2009-08-05 2013-01-30 启*科技股份有限公司 Method for writing message
CN102193764B (en) * 2010-03-11 2016-04-20 英华达(上海)电子有限公司 Show and process electronic system and the method for multiple document
CN101916255B (en) * 2010-07-02 2012-02-15 互动在线(北京)科技有限公司 HTML (Hypertext Markup Language) content contrast device and method
CN103309851B (en) * 2013-05-10 2016-01-27 微梦创科网络科技(中国)有限公司 The rubbish recognition methods of short text and system
CN105763543B (en) * 2016-02-03 2019-08-30 百度在线网络技术(北京)有限公司 A kind of method and device identifying fishing website
CN107294834A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for recognizing spam
CN107480219A (en) * 2017-07-31 2017-12-15 北京微影时代科技有限公司 Information processing method, device, electronic equipment and computer-readable recording medium
CN109753987B (en) * 2018-04-18 2021-08-06 新华三信息安全技术有限公司 File recognition method and feature extraction method
CN112487426A (en) * 2020-11-26 2021-03-12 网宿科技股份有限公司 Method, system and server for determining system white list

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6199103B1 (en) * 1997-06-24 2001-03-06 Omron Corporation Electronic mail determination method and system and storage medium
CN1445722A (en) * 2002-03-14 2003-10-01 精工爱普生株式会社 Method and device for detecting image copy of contents
CN1707492A (en) * 2004-06-05 2005-12-14 腾讯科技(深圳)有限公司 Method for against refuse E-mail

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6199103B1 (en) * 1997-06-24 2001-03-06 Omron Corporation Electronic mail determination method and system and storage medium
CN1445722A (en) * 2002-03-14 2003-10-01 精工爱普生株式会社 Method and device for detecting image copy of contents
CN1707492A (en) * 2004-06-05 2005-12-14 腾讯科技(深圳)有限公司 Method for against refuse E-mail

Also Published As

Publication number Publication date
CN101046858A (en) 2007-10-03

Similar Documents

Publication Publication Date Title
CN101046858B (en) Electronic information comparing system and method and anti-garbage mail system
CN101159704A (en) Microcontent similarity based antirubbish method
CN101950284A (en) Chinese word segmentation method and system
CN108228710B (en) Word segmentation method and device for URL
WO2011153894A1 (en) Method and system for distinguishing image spam mail
CN108764835A (en) Reverse talent's pushed information method and apparatus
CN111859070A (en) Mass internet news cleaning system
CN101079826A (en) Email display method and system
CN102045268A (en) Method and device for recovering email data
CN110472046B (en) Government and enterprise service text clustering method
CN103490979A (en) Electronic mail identification method and system
CN116595568B (en) Private data encryption method based on blockchain
CN1185595C (en) Jamproof theme word extracting method
CN112102883B (en) Base sequence coding method and system in FASTQ file compression
CN113887171A (en) Measuring point code standardization automatic conversion method for wind power generation system
Uemura et al. Unsupervised spam detection by document complexity estimation
CN1432944A (en) Method and system for indentifying Chinese address data
CN116501897B (en) Method for constructing knowledge graph based on fuzzy matching
CN117216022B (en) Digital engineering consultation data management system
CN105955982A (en) Method and system for information sequence feature encoding and retrieval
CN116980378B (en) Method and system for marking repeated message of micro-channel group
CN110688457A (en) Steam-massage industry text information input method based on identification analysis
Ageenko et al. Context-based filtering of document images
CN116702899B (en) Entity fusion method suitable for public and private linkage scene
CN111768312B (en) Building energy consumption monitoring method and device based on railway system data coding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant