WO2001061563A1 - Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs - Google Patents

Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs Download PDF

Info

Publication number
WO2001061563A1
WO2001061563A1 PCT/US2001/004763 US0104763W WO0161563A1 WO 2001061563 A1 WO2001061563 A1 WO 2001061563A1 US 0104763 W US0104763 W US 0104763W WO 0161563 A1 WO0161563 A1 WO 0161563A1
Authority
WO
WIPO (PCT)
Prior art keywords
computer
list
digital
probabilistically unique
digital sequence
Prior art date
Application number
PCT/US2001/004763
Other languages
English (en)
Inventor
Gregory Hagan Moulton
Stephen B. Whitehill
Original Assignee
Avamar Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/777,150 external-priority patent/US6704730B2/en
Application filed by Avamar Technologies, Inc. filed Critical Avamar Technologies, Inc.
Priority to AU3826901A priority Critical patent/AU3826901A/xx
Priority to AU2001238269A priority patent/AU2001238269B2/en
Priority to EP01910686A priority patent/EP1269350A4/fr
Priority to JP2001560878A priority patent/JP4846156B2/ja
Priority to CA002399555A priority patent/CA2399555A1/fr
Publication of WO2001061563A1 publication Critical patent/WO2001061563A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables

Definitions

  • the present invention relates, in general, to the field of hash file systems and commonality factoring systems. More particularly, the present invention relates to a system and method for determining a correspondence between electronic files in a distributed computer data environment and particular applications therefor.
  • the quantity of data that must be managed increases exponentially.
  • operating system and application software becomes larger.
  • the desire to access larger data sets such as multimedia files and large databases further increases the quantity of data that is managed.
  • This increasingly large data load must be transported between computing devices and stored in an accessible fashion.
  • the exponential growth rate of data is expected to outpace the improvements in communication bandwidth and storage capacity, making data management using conventional methods even more urgent. Many factors must be balanced and often compromised in conventional data storage systems. Because the quantity of data is extremely large, there is continuing pressure to reduce the cost per bit of storage. Also, data management systems should be scaleable to contemplate not only current needs, but future needs as well.
  • storage systems are incrementally scaleable so that a user can purchase only the capacity needed at any particular time.
  • High reliability and high availability are also considered as data users are increasingly intolerant of lost, damaged, and unavailable data.
  • conventional data management architectures must compromise these factors so that no one architecture provides a cost- effective, reliable, high availability, scaleable solution.
  • RAID Redundant Array of Independent Disks
  • I/O input/output
  • MTBF mean time between failure
  • RAID systems are difficult to scale because of physical limitations in the cabling and controllers.
  • the availability of RAID systems is highly dependent on the functionality of the controllers themselves so that when a controller fails, the data stored behind the controller becomes unavailable.
  • RAID systems require specialized, rather than commodity hardware, and so tend to be expensive solutions.
  • NAS network-attached storage
  • NAS network-attached storage
  • NAS may provide transparent I/O operations using either hardware or software based RAID.
  • NAS may also automate mirroring of data to one or more other NAS devices to further improve fault tolerance. Because NAS devices can be added to a network, they enable scaling of the total capacity of the storage available to a network. However, NAS devices are constrained in RAID applications to the abilities of the conventional RAID controllers. Also, NAS systems do not enable mirroring and parity across nodes, and so are a limited solution.
  • a system and method for a computer file system that is based and organized upon hashes and/or strings of digits of certain, different, or changing lengths and which is capable of eliminating or screening redundant copies of the blocks of data (or parts of data blocks) from the system.
  • hashes may be produced by a checksum generating program, engine or algorithm such as industry standard Message Digest 4 ("MD4"), MD5, Secure Hash Algorithm (“SHA”) or SHA-1 algorithms.
  • hashes may be generated by a checksum program, engine, algorithm or other means that generates a probabilistically unique hash value for a block of data of indeterminate size based upon a non-linear probablistic mathematical algorithm or any industry standard technique for generating pseudo-random values from an input text of other data/numeric sequence.
  • the system and method of the present invention may be utilized, in a particular application disclosed herein, to automatically factor out redundancies in data allowing potentially very large quantities of unfactored storage to be often reduced in size by several orders of magnitude.
  • the system and method of the present invention would allow all computers, regardless of their particular hardware or software characteristics, to share data simply, efficiently and securely and to provide a uniquely advantageous means for effectuating the reading, writing or referencing of data.
  • the system and method of the present invention is especially efficacious with respect to networked computers or computer systems but may also be applied to isolated data storage with comparable results.
  • the hash file system of the present invention advantageously solves a number of problems that plague conventional storage architectures.
  • the system and method of the present invention eliminates the need for managing a huge collection of directories and files, together with all the wasted system resources that inevitably occur with duplicates, and slightly different copies.
  • the maintenance and storage of duplicate files plagues traditional corporate and private computer systems and generally requires painstaking human involvement to "clean up disk space".
  • the hash file system of the present invention effectively eliminates this problem by eliminating the disk space used for copies and nearly entirely eliminating the disk space used in partial copies. For example, in a traditional computer system copying a gigabyte directory structure to a new location would require another gigabyte of storage.
  • the hash file system of the present invention reduces the disk space used in this operation by up to a hundred thousand times or more.
  • the hash file system of the present invention is designed to factor storage on a scale never previously attempted and in a first implementation, is capable of factoring 2 million petabytes of storage, with the ability to expand to much larger sizes. Existing file systems are incapable of managing data on such scales.
  • the hash file system of the present invention may be utilized to provide inexpensive, global computer system data protection and backup. Its factoring function operates very efficiently on typical backup data sets because computer file systems rarely change more than a few percent of their overall storage between each backup operation. Further, the hash file system of the present invention can serve as the basis for an efficient messaging (e-mail) system.
  • E-mail systems are fundamentally data copying mechanisms wherein an author writes a message and sends it to a list of recipients. An e-mail system implements this "sending" operation effectively by copying the data from one place to another. The author generally keeps copies of the messages he sends and the recipients each keep their own copies. These copies are often, in turn, attached in replies that are also kept (i.e. copies of copies).
  • the commonality factoring feature of the present invention can eliminate this gross inefficiency while transparently allowing e-mail users to retain this familiar copy-oriented paradigm. Because, as previously noted, most data in computer systems rarely change, the hash file system of the present invention allows for the reconstruction of complete snapshots of entire systems which can be kept, for example, for every hour of every day they exist or even continuously, with snapshots taken at even minute (or less) intervals depending on the system needs. Further, since conventional computer systems often provide limited versioning of files (i.e. Digital Equipment Corporation's VAX® VMS® file system), the hash file system of the present invention also provides significant advantages in this regard. Versioning in conventional systems presents both good and bad aspects.
  • the hash file system of the present invention provides versioning of files with little overhead through the factoring of identical copies or edited copies with little extra space. For example, saving one hundred revisions of a typical document typically requires about one hundred times the space of the original file. Using the hash file system disclosed herein, those revisions might require only three times the space of the original (depending on the document's size, the degree and type of editing, and external factors).
  • the hash file system of the present invention can be used to efficiently distribute web content because the method of factoring commonality (hashing) also produces uniform distribution over all hash file system servers. This even distribution permits a large array of servers to function as a gigantic web server farm with an evenly distributed load.
  • the hash file system of the present invention can be used as a network accelerator inasmuch as it can be used to reduce network traffic by sending proxies (hashes) for data instead of the data itself. A large percentage of current network traffic is redundant data moving between locations. Sending proxies for the data would allow effective local caching mechanisms to operate, possibly reducing the traffic on the Internet by several orders of magnitude.
  • the hash file system and method of the present invention may be implemented using 160 bit hashsums as universal pointers. This differs from conventional file systems which use pointers assigned from a central authority (i.e. in Unix a 32 bit "inode" is assigned by the kernel's file systems in a lock-step operation to assure uniqueness).
  • these 160 bit hashsums are assigned without a central authority (i.e. without locking, without synchronization) by a hashing algorithm.
  • hashing algorithms produce probabilistically unique numbers that uniformly span a range of values. In the case of the hash function SHA-1 , that range is between 0 and 10e 48 . This hashing operation is done by examining only the contents of the data being stored and, therefore, can be done in complete isolation, asynchronously, and without interlocking.
  • Hashing is an operation that can be verified by any component of the system, eliminating the need for trusted operations across those components.
  • the hash file system and method of the present invention disclosed herein is, therefore, functional to eliminate the critical bottleneck of conventional large scale distributed file systems, that is, a trusted encompassing central authority. It permits the construction of a large scale distributed file system with no limits on simultaneous read/write operations, that can operate without risk of incoherence and without the limitation of certain conventional bottlenecks.
  • Fig. 1 is a high level illustration of a representative networked computer environment in which the system and method of the present invention may be implemented
  • Fig. 2 is a more detailed conceptual representation of a possible operating environment for utilization of the system and method of the present invention wherein files maintained on any number of computers or data centers may be stored in a decentralized computer system through an Internet connection to a number of Redundant Arrays of Independent Nodes (“RAIN”) racks located, for example, at geographically diverse locations;
  • RAIN Redundant Arrays of Independent Nodes
  • Fig. 3 is a logic flow chart depicting the steps in the entry of a computer file into the hash file system of the present invention wherein the hash value for the file is checked against hash values for files previously maintained in a set, or database;
  • Fig. 4 is a further logic flow chart depicting the steps in the breakup of a file or other data sequence into hashed pieces resulting in the production of a number of data pieces as well as corresponding probabilistically unique hash values for each piece;
  • Fig. 5 is another logic flow chart depicting the comparison of the hash values for each piece of a file to existing hash values in the set (or database), the production of records showing the equivalence of a single hash value for all file pieces with the hash values of the various pieces and whereupon new data pieces and corresponding new hash values are added to the set;
  • Fig. 6 is yet another logic flow chart illustrating the steps in the comparison of file hash or directory list hash values to existing directory list hash values and the addition of new file or directory list hash values to the set directory list;
  • Fig. 7 is a comparison of the pieces of a representative computer file with their corresponding hash values both before and after editing of a particular piece of the exemplary file;
  • Fig. 8 is a conceptual representation of the fact that composite data which may be derived by means of the system and method of the present invention is effectively the same as the data represented explicitly but may instead be created by a "recipe" such as the concatenation of data represented by its corresponding hashes or the result of a function using the data represented by the hashes;
  • Fig. 9 is another conceptual representation of how the hash file system and method of the present invention my be utilized to organize data to optimize the reutilization of redundant sequences through the use of hash values as pointers to the data they represent and wherein data may be represented either as explicit byte sequences (atomic data) or as groups of sequences (composites);
  • Fig. 10 is a simplified diagram illustrative of a hash file system address translation function for an exemplary 160 bit hash value
  • Fig. 11 is a simplified exemplary illustration of an index stripe splitting function for use with the system and method of the present invention.
  • Fig. 12 is a simplified illustration of the overall functionality of the system and method of the present invention for use in the backup of data for a representative home computer having a number of program and document files on Day 1 and wherein one of the document files is edited on Day 2 together with the addition of a third document file;
  • Fig. 13 illustrates the comparison of various pieces of a particular document file marked by a number of "sticky bytes" both before and following editing wherein one of the pieces is thereby changed while other pieces remain the same.
  • hash file system In a particular implementation of the hash file system and method of the present invention as disclosed herein, its application is directed toward a high availability, high reliability data storage system that leverages rapid advances in commodity computing devices and the robust nature of internetwork technology such as the Internet.
  • a hash file system that manages the correspondence of one or more block(s) of data (including but not limited to files, directories, drive images, software applications, digitized voice, and rich media content) together with one or more symbol(s) for that block of data, wherein the symbol may be a number, hash, checksum, binary sequence, or other identifier that is derived from the block of data itself and is statistically, probabilistically, or otherwise effectively unique to that block of data.
  • the system itself works on any computer system including, without limitation: personal computers; supercomputers; distributed or non-distributed networks; storage area networks ("SAN”) using IDE, SCSI or other disk buses; network attached storage (“NAS”) or other systems capable of storing and/or processing data.
  • SAN storage area networks
  • IDE IDE
  • SCSI SCSI
  • NAS network attached storage
  • the symbol(s) may be derived using one or more hash or checksum generating engines, programs, or algorithms, including but not limited to MD4, MD5, SHA, SHA-1 , or their derivatives. Further, the symbol(s) may comprise parts of variable or invariable length symbols derived using a hash or checksum generating engine, program, or algorithm, including but not limited to MD4, MD5, SHA, SHA-1 , or other methods of generating probabilistically unique identifiers based on data content.
  • file seeks, or lookups for retrieving data or checking on the existence/availability of data may be accelerated by looking at all or a smaller portion of the symbol, with the symbol portion indicating or otherwise providing the routing information for finding, retrieving, or checking on the existence/availability of the data.
  • a system and method for a hash file system wherein the symbols allow for the identification of redundant copies within the system and/or allow for the identification of copies within the system redundant with data presented to the system for filing and storage.
  • the symbols allow for the elimination of, or allow for the screening of, redundant copies of the data and/or parts of the data in the system or in data and/or parts of data presented to the system, without loss of data integrity and can provide for the even distribution of data over available storage for the system.
  • the system and method of the present invention as disclosed herein requires no central operating point and balances processing and/or input/output ("I/O") load across all computers, supercomputers, or other devices capable of storing and/or processing data attached to the system.
  • the screening of redundant copies of the data and/or parts of the data allows for the creation, repetitive creation, or retention of intelligent boundaries for screening other data in the system, future data presented to the system, or future data stored by the system.
  • the present invention is illustrated and described in terms of a distributed computing environment such as an enterprise computing system using public communication channels such as the Internet.
  • a distributed computing environment such as an enterprise computing system using public communication channels such as the Internet.
  • an important feature of the present invention is that it is readily scaled upwardly and downwardly to meet the needs of a particular application. Accordingly, unless specified to the contrary the present invention is applicable to significantly larger, more complex network environments as well as small network environments such as conventional LAN systems.
  • an exemplary internetwork environment 10 may include the Internet which comprises a global internetwork formed by logical and physical connection between multiple wide area networks (“WANs") 14 and local area networks (“LANs”) 16.
  • An Internet backbone 12 represents the main lines and routers that carry the bulk of the data traffic.
  • the backbone 12 is formed by the largest networks in the system that are operated by major Internet service providers ("ISPs") such as GTE, MCI, Sprint, UUNet, and America Online, for example.
  • ISPs major Internet service providers
  • a "network” comprises a system of general purpose, usually switched physical connections that enable logical connections between processes operating on nodes 18.
  • the physical connections implemented by a network are typically independent of the logical connections that are established between processes using the network. In this manner, a heterogeneous set of processes ranging from file transfer, mail transfer, and the like can use the same physical network. Conversely, the network can be formed from a heterogeneous set of physical network technologies that are invisible to the logically connected processes using the network. Because the logical connection between processes implemented by a network is independent of the physical connection, internetworks are readily scaled to a virtually unlimited number of nodes over long distances.
  • storage devices may be placed at nodes 18.
  • the storage at any node 18 may comprise a single hard drive, or may comprise a managed storage system such as a conventional RAID device having multiple hard drives configured as a single logical volume.
  • the present invention manages redundancy operations across nodes, as opposed to within nodes, so that the specific configuration of the storage within any given node is less relevant.
  • one or more of the nodes 18 may implement storage allocation management (“SAM”) processes that manage data storage across nodes 18 in a distributed, collaborative fashion.
  • SAM processes preferably operate with little or no centralized control for the system as whole.
  • SAM processes provide data distribution across nodes 18 and implement recovery in a fault-tolerant fashion across network nodes 18 in a manner similar to paradigms found in RAID storage subsystems.
  • SAM processes operate across nodes rather than within a single node or within a single computer, they allow for greater fault tolerance and greater levels of storage efficiency than conventional RAID systems. For example, SAM processes can recover even where a network node 18, LAN 16, or WAN 14 become unavailable. Moreover, even when a portion of the Internet backbone 12 becomes unavailable through failure or congestion, the SAM processes can recover using data distributed on nodes 18 that remain accessible. In this manner, the present invention leverages the robust nature of internetworks to provide unprecedented availability, reliability, fault tolerance and robustness.
  • FIG. 2 a more detailed conceptual view of an exemplary network computing environment in which the present invention is implemented is depicted.
  • the internetwork 10 of the preceding figure (or Internet 1 18 in this figure) enables an interconnected network 100 of a heterogeneous set of computing devices and mechanisms 102 ranging from a supercomputer or data center 104 to a hand-held or pen-based device 1 14. While such devices have disparate data storage needs, they share an ability to retrieve data via network 100 and operate on that data within their own resources.
  • Disparate computing devices 102 including mainframe computers (e.g., VAX station 106 and IBM AS/400 station 116) as well as personal computer or workstation class devices such as IBM compatible device 108, Macintosh device 1 10 and laptop computer 1 12 are readily interconnected via internetwork 10 and network 100. Although not illustrated, mobile and other wireless devices may be coupled to the internetwork 10.
  • mainframe computers e.g., VAX station 106 and IBM AS/400 station 116
  • personal computer or workstation class devices such as IBM compatible device 108, Macintosh device 1 10 and laptop computer 1 12 are readily interconnected via internetwork 10 and network 100.
  • mobile and other wireless devices may be coupled to the internetwork 10.
  • Internet-based network 120 comprises a set of logical connections, some of which are made through Internet 1 18, between a plurality of internal networks 122.
  • Internet-based network 120 is akin to a WAN 14 (Fig. 1 ) in that it enables logical connections between geographically distant nodes.
  • Internet-based networks 120 may be implemented using the Internet 1 18 or other public and private WAN technologies including leased lines, Fibre Channel, and the like.
  • internal networks 122 are conceptually akin to LANs 16 (Fig. 1 ) in that they enable logical connections across a more limited distance than WAN 14.
  • Internal networks 122 may be implemented using various LAN technologies including Ethernet, Fiber Distributed Data Interface ("FDDI”), Token Ring, Appletalk, Fibre Channel, and the like.
  • FDDI Fiber Distributed Data Interface
  • Token Ring Appletalk
  • Fibre Channel Fibre Channel
  • Each internal network 122 connects one or more redundant arrays of independent nodes (RAIN) elements 124 to implement RAIN nodes 18 (Fig. 1 ).
  • RAIN element 124 comprises a processor, memory, and one or more mass storage devices such as hard disks.
  • RAIN elements 124 also include hard disk controllers that may be conventional IDE or SCSI controllers, or may be managing controllers such as RAID controllers.
  • RAIN elements 124 may be physically dispersed or co-located in one or more racks sharing resources such as cooling and power.
  • Each node 18 (Fig. 1 ) is independent of other nodes 18 in that failure or unavailability of one node 18 does not affect availability of other nodes 18, and data stored on one node 18 may be reconstructed from data stored on other nodes 18.
  • the RAIN elements 124 may comprise computers using commodity components such as Intel- based microprocessors mounted on a motherboard supporting a PCI bus and 256 megabytes of random access memory (“RAM") housed in a conventional AT or ATX case.
  • SCSI or IDE controllers may be implemented on the motherboard and/or by expansion cards connected to the PCI bus. Where the controllers are implemented only on the motherboard, a PCI expansion bus may be optionally used.
  • the motherboard may implement two mastering EIDE channels and a PCI expansion card which is used to implement two additional mastering EIDE channels so that each RAIN element 124 includes up to four or more EIDE hard disks.
  • each hard disk may comprise an 80 gigabyte hard disk for a total storage capacity of 320 gigabytes or more per RAIN element.
  • the hard disk capacity and configuration within RAIN elements 124 can be readily increased or decreased to meet the needs of a particular application.
  • the casing also houses supporting mechanisms such as power supplies and cooling devices (not shown).
  • Each RAIN element 124 executes an operating system.
  • the UNIX or UNIX variant operating system such as Linux may be used. It is contemplated, however, that other operating systems including DOS, Microsoft Windows, Apple Macintosh OS, OS/2, Microsoft Windows NT and the like may be equivalently substituted with predictable changes in performance.
  • the operating system chosen forms a platform for executing application software and processes, and implements a file system for accessing mass storage via the hard disk controller(s).
  • Various application software and processes can be implemented on each RAIN element 124 to provide network connectivity via a network interface using appropriate network protocols such as user datagram protocol (“UDP”), transmission control protocol (TCP), Internet protocol (IP) and the like.
  • UDP user datagram protocol
  • TCP transmission control protocol
  • IP Internet protocol
  • a logic flow chart is shown depicting the steps in the entry of a computer file into the hash file system of the present invention and wherein the hash value for the file is checked against hash values for files previously maintained in a set, or database.
  • the process 200 begins by entry of a computer file data 202 (e.g. "File A”) into the hash file system ("HFS") of the present invention upon which a hash function is performed at step 204.
  • the data 206 representing the hash of File A is then compared to the contents of a set containing hash file values at decision step 208. If the data 206 is already in the set, then the file's hash value is added to a directory list at step 210.
  • the contents of the set 212 comprising hash values and corresponding data is provided in the form of existing hash values 214 for the comparison operation of decision step 208.
  • the hash value for File A is not currently in the set, the file is broken into hashed pieces (as will be more fully described hereinafter) at step 216.
  • a further logic flow chart depicting the steps in the process 300 for breakup of a digital sequence (e.g. a file or other data sequence) into hashed pieces.
  • This process 300 ultimately results in the production of a number of data pieces as well as corresponding probabilistically unique hash values for each piece.
  • the file data 302 is divided into pieces based on commonality with other pieces in the system or the likelihood of pieces being found to be in common in the future at step 304.
  • the results of the operation of step 304 upon the file data 302 is, in the representative example shown, the production of four file pieces 306 denominated A1 through A5 inclusively.
  • Each of the file pieces 306 is then operated on at step 308 by placing it through individual hash function operations to assign a probabilistically unique number to each of the pieces 306 A1 through A5.
  • the results of the operation at step 308 is that each of the pieces 306 (A1 through A5) has an associated, probabilistically unique hash value 310 (shown as A1 Hash through A5 Hash respectively).
  • the file division process of step 304 is described in greater detail hereinafter in conjunction with the unique "sticky byte" operation also disclosed herein.
  • FIG. 5 another logic flow chart is shown depicting a comparison process 400 for the hash values 310 of each piece 306 of the file to those of existing hash values 214 maintained in the set 212.
  • the hash values 310 for each piece 306 of the file are compared to existing hash values 214 and new hash values 408 and corresponding new data pieces 406 are added to the set 212.
  • hash values 408 not previously present in the database set 212 are added together with their associated data pieces 406.
  • the process 400 also results in the production of records 404 showing the equivalence of a single hash value for all file pieces with the hash values 310 of the various pieces 306.
  • FIG. 6 yet another logic flow chart is shown illustrating a process 500 for the comparison of file hash or directory list hash values to existing directory list hash values and the addition of new file or directory list hash values to the database directory list.
  • the process 500 operates on stored data 502 which comprises an accumulated list of file names, file meta-data (e.g. date, time, file length, file type etc.) and the file's hash value for each item in a directory.
  • the hash function is run upon the contents of the directory list.
  • Decision step 506 is operative to determine whether or not the hash value for the directory list is in the set 212 of existing hash values 214.
  • the process 500 returns to add another file hash or directory list hash to a directory list.
  • the hash value for the directory list is not already in the database set 212, the has,h value and data for the directory list are added to the database 212 set at step 508.
  • a comparison 600 of the pieces 306 of a representative computer file (i.e. "File A”) with their corresponding hash values 310 is shown both before and after editing of a particular piece of the exemplary file.
  • the record 404 contains the hash value of File A as well as the hash values 310 of each of the pieces of the file A1 through A5.
  • a representative edit of the File A may produce a change in the data for piece A2 (now represented by A2-b) of the file pieces 306A along with a corresponding change in the hash value A2-b of the hash values 310A.
  • the edited file piece now produces an updated record 404A which includes the modified hash value of File A and the modified hash value of piece A2-b.
  • a conceptual representation 700 is shown illustrative of the fact that composite data (such as composite data 702 and 704) derived by means of the system and method of the present invention, is effectively the same as the data 706 represented explicitly but is instead created by a "recipe", or formula.
  • this recipe includes the concatenation of data represented by its corresponding hashes 708 or the result of a function using the data represented by the hashes.
  • the data blocks 706 may be variable length quantities as shown and the hash values 708 are derived from their associated data blocks.
  • the hash values 708 are a probabilistically unique identification of the corresponding data pieces but truly unique identifications can be used instead or intermixed therewith.
  • composite data 702, 704 can also reference other composite data many levels deep while the hash values 708 for the composite data can be derived from the value of the data the recipe creates or the hash value of the recipe itself.
  • FIG. 9 another conceptual representation 800 is shown of how the hash file system and method of the present invention may be utilized to organize data 802 to optimize the reutilization of redundant sequences through the use of hash values 806 as pointers to the data they represent and wherein data 802 may be represented either as explicit byte sequences (atomic data) 808 or as groups of sequences (composites) 804.
  • the representation 800 illustrates the tremendous commonality of recipes and data that gets reused at every level.
  • the basic structure of the hash file system of the present invention is essentially that of a "tree” or "bush” wherein the hash values 806 are used instead of conventional pointers.
  • the hash values 806 are used in the recipes to point to the data or another hash value that could also itself be a recipe.
  • recipes can point to other recipes that point to still other recipes that ultimately point to some specific data that may, itself, point to other recipes that point to even more data, eventually getting down to nothing but data.
  • a simplified diagram 900 is shown illustrative of a hash file system address translation function for an exemplary 160 bit hash value 902.
  • the hash value 902 includes a data structure comprising a front portion 904 and a back portion 906 as shown and the diagram 900 illustrates a particular "0 of 1 " operation that is used for enabling the use of the hash value 902 to go to the location of the particular node in the system that contains the corresponding data.
  • the diagram 900 illustrates how the front portion 904 of the hash value 902 data structure may be used to indicate the hash prefix to stripe identification ("ID") 908 and how that is, in turn, utilized to map the stripe ID to IP address and the ID class to IP address 910.
  • ID hash prefix to stripe identification
  • the "S2" indicates stripe 2 of index Node 37 912.
  • the index stripe 912 of Node 37 then indicates stripe 88 of data Node 73 indicated by the reference numeral 914.
  • a portion of the hash value 902 itself may be used to indicate which node in the system contains the relevant data
  • another portion of the hash value 902 may be used to indicate which stripe of data at that particular node and yet another portion of the hash value 902 to indicate where within that stripe the data resides.
  • FIG. 1 1 a simplified exemplary illustration of an index stripe splitting function 1000 is shown for use with the system and method of the present invention.
  • an exemplary function 1000 is shown that may be used to effectively split a stripe 1002 (S2) into two stripes 1004 (S2) and 1006 (S7) should one stripe become too full.
  • the odd entries have been moved to stripe 1006 (S7) while the even ones remain in stripe 1004.
  • This function 1000 is one example of how stripe entries may be handled as the overall system grows in size and complexity.
  • FIG. 12 a simplified illustration 1 100 of the overall functionality of the system and method of the present invention is shown for use, for example, in the backup of data for a representative home computer having a number of program and document files 1 102A and 1104A on Day 1 and wherein the program files 1102B remain the same on Day 2 while one of the document files 1 104B is edited on Day 2 (Y.doc) together with the addition of a third document file (Z.doc).
  • the illustration 1 100 shows the details of how a computer file system may be broken into pieces and then listed as a series of recipes on a global data protection network ("gDPN") to reconstruct the original data from the pieces.
  • gDPN global data protection network
  • This very small computer system is shown in the form of a "snapshot" on “Day 1 " and then subsequently on “Day 2".
  • the "program files H5" and "my documents H6" are illustrated by numeral 1 106, with the former being represented by a recipe 1108 wherein a first executable file is represented by a hash value H1 1 1 14 and a second represented by a hash value H2 1 1 12.
  • the document files are represented by hash value H6 1 1 10 with the first document being represented by hash value H3 1 1 18 and the second by hash value H4 1 1 16.
  • H10 indicated by numeral 1 120 shows that the "program files H5" have not changed, but the "my document H 10" have.
  • H10 indicated by numeral 1 122 shows the "X.doc” is still represented by hash value H3 1118 while “Y.doc” is now represented by hash value H8 at number 1 124.
  • New document file “Z.doc” is now represented by hash value H9 at numeral 1126.
  • a comparison 1200 of various pieces of a particular document file marked by a number of "sticky bytes" 1204 is shown both before (Day 1 1202A) and following editing (Day 2 1202B) wherein one of the pieces is thereby changed while other pieces remain the same.
  • file 1202A comprises variable length pieces 1206 (1.1 ), 1208 (1.2), 1210 (2.1 ), 1212 (2.), 1214 (2.3) and 1216 (3.1 ).
  • pieces 1206, 1208, 1210, 1214 and 1216 remain the same (thus having the same hash values) while piece 1212 has now been edited to produce piece 1212A (thus having a differing hash value).
  • Data sticky bytes are a unique, fully automated way to sub-divide computer files such that common elements may be found on multiple related and unrelated computers without the need for communication between the computers.
  • the means in which data sticky points are found is completely mathematical in nature and performs equally well regardless of the data content of the files.
  • all data objects may be indexed, stored and retrieved using, for example (but not limited to), an industry standard checksum such as: MD4, MD5, SHA, or SHA-1. In operation, if two files have the same checksum, it may be considered to be highly likely that they are the same file.
  • data sticky points may be produced with a standard mathematical distribution and with standard deviations that are a small percentage of the target size.
  • a data sticky point is a statistically infrequent arrangement of n bytes. In this case, an example is given with 32 bytes because of its ease in implementation in current microprocessor technology. A rolling hash of 32 bits could be generated for the file "f".
  • //scramble is a 256 entry array of integers with each //being 32 bits wide; //these integers are typically chosen to uniformly //span the range.
  • hash represents the //rolling hash of the file.
  • sticky_bits (hash - 1 ) ⁇ hash;
  • number_of_bits count_ones(stick-bits); if(number_of_bits > t) output_sticky_point(i); ⁇
  • a rolling hash of 32 bits may be generated for the f file where:
  • f[i] is the ith byte of the file f.
  • sticky_bits (hash - 1 ) ⁇ hash; //sticks_bits is a variable which will have the //number of ones that correspond to the number of //trailing zeros in the "hash”.
  • Hashing function utilized to implement the hash file system of the present invention requires a moderately complex computation, it is well within the capability of present day computer systems. Hashing functions are inherently probabilistic and any hashing function might possibly produce incorrect results when two different data objects happen to have the same hash value. However, the system and method herein disclosed mitigates this problem by using well known and researched hashing functions that reduce the probability of collision down to levels acceptable for reliable use (i.e. one chance in a trillion trillion), far less than the error rates otherwise tolerated in conventional computer hardware operations.
  • Internet infrastructure encompasses a variety of hardware and software mechanisms, the term primarily refers to routers, router software, and physical links between these routers that function to transport data packets from one network node to another.
  • a "digital sequence" may comprise, without limitation, computer program files, computer applications, data files, network packets, streaming data such as multimedia (including audio and video), telemetry data and any other form of data which can be represented by a digital or numeric sequence.
  • the probabilistically unique identifiers produced by means of the hash file system and method of the present invention may also be used as URLs in network applications.

Abstract

L'invention concerne un système utile pour un système de fichiers informatiques, qui fondé et organisé sur des hachages et/ou des chaînes de chiffres de longueurs (304) différentes ou variables, et capable d'éliminer du système ou d'analyser des copies redondantes de blocs de données agrégés (ou de parties de blocs de données). Le système de fichiers de hachage de l'invention utilise des valeurs (310) de hachage pour fichiers informatiques ou morceaux (306) de fichier, qui peuvent être produites par un programme, un moteur ou un algorithme de contrôle de sommes, tels que les algorithmes commerciaux standard MD4, MD5, SHA ou SHA-1. Dans une autre forme de réalisation, les valeurs de hachage peuvent être produites (308) par un programme, un moteur, un algorithme ou un autre moyen de contrôle de sommes, qui produit une valeur de hachage réellement unique pour un bloc de données de taille indéterminée sur la base d'un algorithme mathématique.
PCT/US2001/004763 2000-02-18 2001-02-14 Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs WO2001061563A1 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU3826901A AU3826901A (en) 2000-02-18 2001-02-14 Hash file system and method for use in a commonality factoring system
AU2001238269A AU2001238269B2 (en) 2000-02-18 2001-02-14 Hash file system and method for use in a commonality factoring system
EP01910686A EP1269350A4 (fr) 2000-02-18 2001-02-14 Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs
JP2001560878A JP4846156B2 (ja) 2000-02-18 2001-02-14 共通性ファクタリングシステムに用いられるハッシュファイルシステムおよび方法
CA002399555A CA2399555A1 (fr) 2000-02-18 2001-02-14 Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US18376200P 2000-02-18 2000-02-18
US60/183,762 2000-02-18
US24592000P 2000-11-06 2000-11-06
US60/245,920 2000-11-06
US09/777,150 US6704730B2 (en) 2000-02-18 2001-02-05 Hash file system and method for use in a commonality factoring system
US09/777,150 2001-02-05

Publications (1)

Publication Number Publication Date
WO2001061563A1 true WO2001061563A1 (fr) 2001-08-23

Family

ID=27391739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/004763 WO2001061563A1 (fr) 2000-02-18 2001-02-14 Systeme et procede de fichiers de hachage utiles dans un systeme de factorisation d'elements communs

Country Status (6)

Country Link
EP (1) EP1269350A4 (fr)
JP (1) JP4846156B2 (fr)
KR (1) KR100860821B1 (fr)
AU (2) AU2001238269B2 (fr)
CA (1) CA2399555A1 (fr)
WO (1) WO2001061563A1 (fr)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003062996A1 (fr) 2002-01-17 2003-07-31 Thomson Licensing S.A. Systeme et procede de recherche de donnees dupliquees
EP1607868A1 (fr) * 2003-03-10 2005-12-21 Sharp Kabushiki Kaisha Dispositif de traitement de donnees, programme de traitement de donnees et support d'enregistrement
US7124305B2 (en) 2000-02-18 2006-10-17 Permabit, Inc. Data repository and method for promoting network storage of data
US7293027B2 (en) 2003-02-26 2007-11-06 Burnside Acquisition, Llc Method for protecting history in a file system
EP1866774A1 (fr) * 2005-03-11 2007-12-19 Rocksoft Limited Procede de stockage de donnees avec une moindre redondance au moyen de groupements de donnees
GB2444344A (en) * 2006-12-01 2008-06-04 David Irvine File storage and recovery in a Peer to Peer network
GB2444343A (en) * 2006-12-01 2008-06-04 David Irvine Encryption system for peer-to-peer networks in which data is divided into chunks and self-encryption is applied
EP2013974A2 (fr) * 2006-04-07 2009-01-14 Data Storage Group Techniques de compression et de stockage de donnees
JP2009181590A (ja) * 2001-09-06 2009-08-13 Iron Mountain Inc 選択的データバックアップ
EP2093885A1 (fr) * 2002-10-30 2009-08-26 Riverbed Technology, Inc. Schéma de segmentation à base de contenu pour la compression de données dans le stockage et la transmission
JP2009543198A (ja) * 2006-06-29 2009-12-03 ネットアップ,インコーポレイテッド ブロック指紋を読み出し、ブロック指紋を使用してデータ重複を解消するシステム、及び方法
JP2009543199A (ja) * 2006-06-29 2009-12-03 ネットアップ,インコーポレイテッド 永久的コンシステンシ・ポイント・イメージを使用するストレージシステムのデータ重複解消を管理するシステム、及び方法
EP2164005A3 (fr) * 2008-09-11 2010-09-29 NEC Laboratories America, Inc. Systèmes de stockage adressables de contenu et procédés employant des blocs recherchables
US8151066B2 (en) 2007-03-27 2012-04-03 Hitachi, Ltd. Computer system preventing storage of duplicate files
US8275782B2 (en) 2004-09-15 2012-09-25 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US8327097B2 (en) 2008-02-26 2012-12-04 Kddi Corporation Data backing up for networked storage devices using de-duplication technique
US8423556B2 (en) 2009-07-14 2013-04-16 Fujitsu Limited Archive device
US8650370B2 (en) 2009-07-21 2014-02-11 Fujitsu Limited Data storing method and data storing system
US8725705B2 (en) 2004-09-15 2014-05-13 International Business Machines Corporation Systems and methods for searching of storage data with reduced bandwidth requirements
US8832045B2 (en) 2006-04-07 2014-09-09 Data Storage Group, Inc. Data compression and storage techniques
US9037545B2 (en) 2006-05-05 2015-05-19 Hybir Inc. Group based complete and incremental computer file backup system, process and apparatus
CN106716375A (zh) * 2014-09-30 2017-05-24 微软技术许可有限责任公司 具有每区段校验和的文件系统

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810398B2 (en) * 2000-11-06 2004-10-26 Avamar Technologies, Inc. System and method for unorchestrated determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences
US7092976B2 (en) * 2003-06-24 2006-08-15 International Business Machines Corporation Parallel high speed backup for a storage area network (SAN) file system
US7428557B2 (en) * 2004-03-22 2008-09-23 Microsoft Corporation Efficient data transfer to/from storage medium of computing device
JP4568532B2 (ja) * 2004-05-13 2010-10-27 日本無線株式会社 無線装置制御システム、無線装置および制御装置
EP1866776B1 (fr) * 2005-03-11 2015-12-30 Rocksoft Limited Procede permettant de detecter la presence de sous-blocs dans un systeme de stockage a redondance reduite
US7831793B2 (en) * 2006-03-01 2010-11-09 Quantum Corporation Data storage system including unique block pool manager and applications in tiered storage
US8490078B2 (en) * 2007-09-25 2013-07-16 Barclays Capital, Inc. System and method for application management
US8447938B2 (en) * 2008-01-04 2013-05-21 International Business Machines Corporation Backing up a deduplicated filesystem to disjoint media
JP5248912B2 (ja) * 2008-05-12 2013-07-31 株式会社日立製作所 サーバ計算機、計算機システムおよびファイル管理方法
US7992037B2 (en) * 2008-09-11 2011-08-02 Nec Laboratories America, Inc. Scalable secondary storage systems and methods
JP5313600B2 (ja) * 2008-09-16 2013-10-09 株式会社日立製作所 ストレージシステム、及びストレージシステムの運用方法
KR101046025B1 (ko) * 2008-10-07 2011-07-01 이경수 임베디드 시스템 데이터 추출 장치 및 그 방법
JP5444728B2 (ja) * 2009-01-26 2014-03-19 日本電気株式会社 ストレージシステム、ストレージシステムにおけるデータ書込方法及びデータ書込プログラム
JP5463746B2 (ja) * 2009-06-15 2014-04-09 日本電気株式会社 アーカイブストレージ装置、ストレージシステム、データ格納方法、およびデータ格納プログラム
CN102792281B (zh) * 2010-03-04 2015-11-25 日本电气株式会社 存储设备
US20110314070A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Optimization of storage and transmission of data
JP5650982B2 (ja) 2010-10-25 2015-01-07 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation ファイルの重複を排除する装置及び方法
US9195666B2 (en) * 2012-01-17 2015-11-24 Apple Inc. Location independent files
CN103873503A (zh) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 数据块备份系统及方法
KR101613146B1 (ko) * 2015-03-24 2016-04-18 주식회사 티맥스데이터 데이터베이스 암호화 방법
CN109639807A (zh) * 2018-12-19 2019-04-16 中国四维测绘技术有限公司 一种基于slice切片的大数据量遥感影像文件网络传输方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5016009A (en) * 1989-01-13 1991-05-14 Stac, Inc. Data compression apparatus and method
US5126739A (en) * 1989-01-13 1992-06-30 Stac Electronics Data compression apparatus and method
US5140321A (en) * 1991-09-04 1992-08-18 Prime Computer, Inc. Data compression/decompression method and apparatus
US5406279A (en) * 1992-09-02 1995-04-11 Cirrus Logic, Inc. General purpose, hash-based technique for single-pass lossless data compression
US5754844A (en) * 1995-12-14 1998-05-19 Sun Microsystems, Inc. Method and system for accessing chunks of data using matching of an access tab and hashing code to generate a suggested storage location
US5831558A (en) * 1996-06-17 1998-11-03 Digital Equipment Corporation Method of compressing and decompressing data in a computer system by encoding data using a data dictionary
US5850565A (en) * 1996-08-26 1998-12-15 Novell, Inc. Data compression method and apparatus

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5202982A (en) * 1990-03-27 1993-04-13 Sun Microsystems, Inc. Method and apparatus for the naming of database component files to avoid duplication of files
JPH06103127A (ja) * 1992-09-22 1994-04-15 Kanebo Ltd ハッシュファイルデータ管理装置およびハッシュファイルデータ管理方法
US5990810A (en) * 1995-02-17 1999-11-23 Williams; Ross Neil Method for partitioning a block of data into subblocks and for storing and communcating such subblocks
EP2270687A2 (fr) * 1995-04-11 2011-01-05 Kinetech, Inc. Identification de données dans un système de traitement de données
US6021491A (en) * 1996-11-27 2000-02-01 Sun Microsystems, Inc. Digital signatures for data streams and data archives
US6098079A (en) * 1998-04-02 2000-08-01 Mitsubishi Electric Information Technology Center America, Inc. (Ita) File version reconciliation using hash codes
US7062648B2 (en) * 2000-02-18 2006-06-13 Avamar Technologies, Inc. System and method for redundant array network storage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5016009A (en) * 1989-01-13 1991-05-14 Stac, Inc. Data compression apparatus and method
US5126739A (en) * 1989-01-13 1992-06-30 Stac Electronics Data compression apparatus and method
US5140321A (en) * 1991-09-04 1992-08-18 Prime Computer, Inc. Data compression/decompression method and apparatus
US5281967A (en) * 1991-09-04 1994-01-25 Jung Robert K Data compression/decompression method and apparatus
US5406279A (en) * 1992-09-02 1995-04-11 Cirrus Logic, Inc. General purpose, hash-based technique for single-pass lossless data compression
US5754844A (en) * 1995-12-14 1998-05-19 Sun Microsystems, Inc. Method and system for accessing chunks of data using matching of an access tab and hashing code to generate a suggested storage location
US5831558A (en) * 1996-06-17 1998-11-03 Digital Equipment Corporation Method of compressing and decompressing data in a computer system by encoding data using a data dictionary
US5850565A (en) * 1996-08-26 1998-12-15 Novell, Inc. Data compression method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1269350A4 *

Cited By (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7587617B2 (en) 2000-02-18 2009-09-08 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US7398283B2 (en) 2000-02-18 2008-07-08 Burnside Acquisition, Llc Method for providing access control for data items in a data repository in which storage space used by identical content is shared
US7693814B2 (en) 2000-02-18 2010-04-06 Permabit Technology Corporation Data repository and method for promoting network storage of data
US7124305B2 (en) 2000-02-18 2006-10-17 Permabit, Inc. Data repository and method for promoting network storage of data
US7356701B2 (en) 2000-02-18 2008-04-08 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US7287030B2 (en) 2000-02-18 2007-10-23 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US7412462B2 (en) 2000-02-18 2008-08-12 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US7685096B2 (en) 2000-02-18 2010-03-23 Permabit Technology Corporation Data repository and method for promoting network storage of data
US9177175B2 (en) 2000-02-18 2015-11-03 Permabit Technology Corporation Data repository and method for promoting network storage of data
US7506173B2 (en) 2000-02-18 2009-03-17 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US7657931B2 (en) 2000-02-18 2010-02-02 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
JP2009181590A (ja) * 2001-09-06 2009-08-13 Iron Mountain Inc 選択的データバックアップ
KR100959306B1 (ko) * 2002-01-17 2010-05-26 톰슨 라이센싱 중복 데이터 탐색 시스템 및 방법
EP1466251A4 (fr) * 2002-01-17 2007-04-25 Thomson Licensing Systeme et procede de recherche de donnees dupliquees
EP1466251A1 (fr) * 2002-01-17 2004-10-13 Thomson Licensing S.A. Systeme et procede de recherche de donnees dupliquees
WO2003062996A1 (fr) 2002-01-17 2003-07-31 Thomson Licensing S.A. Systeme et procede de recherche de donnees dupliquees
EP2093885A1 (fr) * 2002-10-30 2009-08-26 Riverbed Technology, Inc. Schéma de segmentation à base de contenu pour la compression de données dans le stockage et la transmission
US7734595B2 (en) 2003-02-26 2010-06-08 Permabit Technology Corporation Communicating information between clients of a data repository that have deposited identical data items
US8055628B2 (en) 2003-02-26 2011-11-08 Permabit Technology Corporation History preservation in a computer storage system
US7496555B2 (en) 2003-02-26 2009-02-24 Permabit, Inc. History preservation in a computer storage system
US7478096B2 (en) 2003-02-26 2009-01-13 Burnside Acquisition, Llc History preservation in a computer storage system
US9104716B2 (en) 2003-02-26 2015-08-11 Permabit, Inc. History preservation in a computer storage system
US8095516B2 (en) 2003-02-26 2012-01-10 Permabit Technology Corporation History preservation in a computer storage system
US7467144B2 (en) 2003-02-26 2008-12-16 Burnside Acquisition, Llc History preservation in a computer storage system
US7987197B2 (en) 2003-02-26 2011-07-26 Permabit Technology Corporation History preservation in a computer storage system
US7979397B2 (en) 2003-02-26 2011-07-12 Permabit Technology Corporation History preservation in a computer storage system
US7930315B2 (en) 2003-02-26 2011-04-19 Permabit Technology Corporation History preservation in a computer storage system
US7912855B2 (en) 2003-02-26 2011-03-22 Permabit Technology Corporation History preservation in a computer storage system
US7747583B2 (en) 2003-02-26 2010-06-29 Permabit Technology Corporation History preservation in a computer storage system
US7363326B2 (en) 2003-02-26 2008-04-22 Burnside Acquisition, Llc Archive with timestamps and deletion management
US7318072B2 (en) 2003-02-26 2008-01-08 Burnside Acquisition, Llc History preservation in a computer storage system
US7293027B2 (en) 2003-02-26 2007-11-06 Burnside Acquisition, Llc Method for protecting history in a file system
EP1607868A4 (fr) * 2003-03-10 2009-04-22 Sharp Kk Dispositif de traitement de donnees, programme de traitement de donnees et support d'enregistrement
EP1607868A1 (fr) * 2003-03-10 2005-12-21 Sharp Kabushiki Kaisha Dispositif de traitement de donnees, programme de traitement de donnees et support d'enregistrement
US8275756B2 (en) 2004-09-15 2012-09-25 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US8275782B2 (en) 2004-09-15 2012-09-25 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US10649854B2 (en) 2004-09-15 2020-05-12 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US10282257B2 (en) 2004-09-15 2019-05-07 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US9430486B2 (en) 2004-09-15 2016-08-30 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US9400796B2 (en) 2004-09-15 2016-07-26 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US9378211B2 (en) 2004-09-15 2016-06-28 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US8725705B2 (en) 2004-09-15 2014-05-13 International Business Machines Corporation Systems and methods for searching of storage data with reduced bandwidth requirements
US8275755B2 (en) 2004-09-15 2012-09-25 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
EP1866774A4 (fr) * 2005-03-11 2010-04-14 Rocksoft Ltd Procede de stockage de donnees avec une moindre redondance au moyen de groupements de donnees
EP1866774A1 (fr) * 2005-03-11 2007-12-19 Rocksoft Limited Procede de stockage de donnees avec une moindre redondance au moyen de groupements de donnees
EP2013974A4 (fr) * 2006-04-07 2009-08-05 Data Storage Group Techniques de compression et de stockage de donnees
US7860843B2 (en) 2006-04-07 2010-12-28 Data Storage Group, Inc. Data compression and storage techniques
AU2007234696B2 (en) * 2006-04-07 2011-08-18 Data Storage Group Data compression and storage techniques
EP2013974A2 (fr) * 2006-04-07 2009-01-14 Data Storage Group Techniques de compression et de stockage de donnees
US8832045B2 (en) 2006-04-07 2014-09-09 Data Storage Group, Inc. Data compression and storage techniques
US9037545B2 (en) 2006-05-05 2015-05-19 Hybir Inc. Group based complete and incremental computer file backup system, process and apparatus
US10671761B2 (en) 2006-05-05 2020-06-02 Hybir Inc. Group based complete and incremental computer file backup system, process and apparatus
US9679146B2 (en) 2006-05-05 2017-06-13 Hybir Inc. Group based complete and incremental computer file backup system, process and apparatus
JP2009543199A (ja) * 2006-06-29 2009-12-03 ネットアップ,インコーポレイテッド 永久的コンシステンシ・ポイント・イメージを使用するストレージシステムのデータ重複解消を管理するシステム、及び方法
JP2009543198A (ja) * 2006-06-29 2009-12-03 ネットアップ,インコーポレイテッド ブロック指紋を読み出し、ブロック指紋を使用してデータ重複を解消するシステム、及び方法
GB2444344A (en) * 2006-12-01 2008-06-04 David Irvine File storage and recovery in a Peer to Peer network
GB2444343B (en) * 2006-12-01 2012-04-18 David Irvine Self encryption
GB2446200A (en) * 2006-12-01 2008-08-06 David Irvine Encryption system for peer-to-peer networks which relies on hash based self-encryption and mapping
GB2444343A (en) * 2006-12-01 2008-06-04 David Irvine Encryption system for peer-to-peer networks in which data is divided into chunks and self-encryption is applied
US8407431B2 (en) 2007-03-27 2013-03-26 Hitachi, Ltd. Computer system preventing storage of duplicate files
US8151066B2 (en) 2007-03-27 2012-04-03 Hitachi, Ltd. Computer system preventing storage of duplicate files
US8327097B2 (en) 2008-02-26 2012-12-04 Kddi Corporation Data backing up for networked storage devices using de-duplication technique
EP2164005A3 (fr) * 2008-09-11 2010-09-29 NEC Laboratories America, Inc. Systèmes de stockage adressables de contenu et procédés employant des blocs recherchables
US8423556B2 (en) 2009-07-14 2013-04-16 Fujitsu Limited Archive device
US8650370B2 (en) 2009-07-21 2014-02-11 Fujitsu Limited Data storing method and data storing system
CN106716375A (zh) * 2014-09-30 2017-05-24 微软技术许可有限责任公司 具有每区段校验和的文件系统
CN106716375B (zh) * 2014-09-30 2019-12-03 微软技术许可有限责任公司 具有每区段校验和的文件系统

Also Published As

Publication number Publication date
CA2399555A1 (fr) 2001-08-23
AU3826901A (en) 2001-08-27
KR100860821B1 (ko) 2008-09-30
KR20020082851A (ko) 2002-10-31
JP2003524243A (ja) 2003-08-12
AU2001238269B2 (en) 2006-06-22
EP1269350A1 (fr) 2003-01-02
JP4846156B2 (ja) 2011-12-28
EP1269350A4 (fr) 2006-08-16

Similar Documents

Publication Publication Date Title
US6704730B2 (en) Hash file system and method for use in a commonality factoring system
AU2001238269B2 (en) Hash file system and method for use in a commonality factoring system
AU2001238269A1 (en) Hash file system and method for use in a commonality factoring system
AU2001296665B2 (en) System for identifying common digital sequences
AU2001296665A1 (en) System for identifying common digital sequences
US9967298B2 (en) Appending to files via server-side chunking and manifest manipulation
US7457800B2 (en) Storage system for randomly named blocks of data
US8200788B2 (en) Slice server method and apparatus of dispersed digital storage vaults
EP2147437B1 (fr) Réplication d'ensemencement
US20070255758A1 (en) System and method for sampling based elimination of duplicate data
EP2758883A1 (fr) Gestion d'asymétrie de tailles d'extensions de données au cours d'une réplication logique dans un système de stockage
Kumar et al. Differential Evolution based bucket indexed data deduplication for big data storage
Phyu et al. Efficient data deduplication scheme for scale-out distributed storage
Bhagwat DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE
Dmitry RP* ha-A HIGH-AVAILABILITY SCALABLE DISTRIBUTED DATA STRUCTURE BASED ON RANGE PARTITIONING

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2399555

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2001 560878

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020027010804

Country of ref document: KR

REEP Request for entry into the european phase

Ref document number: 2001910686

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2001910686

Country of ref document: EP

Ref document number: 2001238269

Country of ref document: AU

WWP Wipo information: published in national office

Ref document number: 1020027010804

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2001910686

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642