Dangers of Document Metadata
Document metadata comes in many forms. Below is a list of the types of metadata found in Microsoft Office documents and the risks that each type of metadata poses to a corporation.
Track Changes and Document Revisions
Microsoft Word, Microsoft Excel and Microsoft PowerPoint documents. The Track Changes feature tracks changes (inserted, deleted, and moved text) made to a document during a review. As changes are made to a document using Track Changes, a new revision of the document is kept by the application. This revision history exists, even after changes to the document have been accepted or rejected.
Track Changes shows the history of changes to the document. If Track Changes is left on, but the highlight on the screen is turned off, every change made to the document still remains. This is like recording every single keystroke made to the document that can be viewed by subsequent reviewers. Thus, even though the Track Changes are not visible, it still travels with the document and, in some circumstances, it can be sent to and seen by an unintentional party with potentially disastrous consequences.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Comments are notes and suggestions that are added to a document via the comment feature to help facilitate an online review.
Comments, like hidden text, unless intentionally removed can display sensitive information to external parties because comment metadata travels with the document. Microsoft Excel and Microsoft PowerPoint documents are especially susceptible to this risk as there is no internal mechanism built into these applications to warn a user that comments are embedded.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document properties are details about a file that help identify it and include descriptive title, subject, author, manager, company, category, keywords, comments, and hyperlink base. Document properties display information about a file to help organize the files so that they can be easily found at a later date.
The names of authors and the name of the company can display sensitive information about a corporation. It is possible that if a document has been sent outside your own corporation, the author name and company name contained in the built-in properties could be a name other than your own. In addition, if documents are re-purposed or used as a template for a new document, information that is specific to a previous client such as pricing, terms, or the client's name can be stored as hidden information within the new document.
Document Statistics and File Dates
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document statistics include information on when the document was created, when it was modified, when it was accessed, and when it was printed. In addition, document statistics display the name of the person it was last saved by, the revision number, and the total editing time. Other statistics can include number of pagers, paragraphs, lines, words, and characters.
Document statistics can create embarrassing situations. For example, the "last saved by" metadata shows the last person who edited the document. Repurposing previous documents can reveal a history that you may not want to share with another person or organization.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document reviewers consist of a list of users that have added or accepted any track changes. When the names of reviewers are removed, but not the Track Changes, the revisions remain with the document. However, the user name associated with each revision will be removed. It is recommended that the names of the document reviewers be removed when removing track changes.
The risk from the Document Reviewers metadata is that it can expose who has previously reviewed the document and who has suggested changes.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Custom Properties include any property fields added manually to a document or by various programs to help manage and track files. Common types of custom properties used to identify specific data are DocumentID, department and status.
Custom Properties are normally things specific to an organization and may represent proprietary information or competitive business practice. The potential risk arises because it is easy to see a history of this document and reveal internal practices.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Hidden text are text blocks that have been formatted as hidden. Unless specifically selected to be viewed in Microsoft Word, hidden text is not displayed within the document.
Hidden text can contain notes that are particular to a document. As hidden information that is not cleansed, the hidden text can potentially be viewed by unintentional parties.
Header and Footers
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Headers and footers are areas in the top and bottom margins of each page in a document. Text or graphics can be inserted in headers and footers—for example, page numbers, the date, a company logo, the document's title or file name, or the author's name—that are printed at the top or bottom of each page in a document.
Custom header and footers can contain descriptions such as filename, path, the date and time the document was modified, or other information that is deemed important to make it easy to retrieve and edit a file. Unfortunately, the information contained in footers and headers is often overlooked when the document is shared. Failure to remove this information can result in revealing confidential information.
Microsoft Word documents only. Footnotes attributed to content are embedded as metadata into Microsoft Word documents.
Footnotes may expose private, internal directions about how the document is used in the organization.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. White text is blocks of text that have been formatted with a font color of white on a background of white. The text appears invisible when viewed or printed and can be used to hide information in a document. This method of blocking out text using White text is often called redaction.
White text is commonly used when documents are posted to the Internet so that can be more readily found by search engines and to hide confidential information in redacted documents. However, white text can also be viewed by external users. Depending upon what was actually written as white text, the information can be very damaging. White text can also be used for particular field codes such as the "include text" field code, which can point to a file location. If this file location code is embedded in a document, users can unknowingly be updating the code and can potentially expose the document to a hacker.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Any text block contained in a document that is less that five (5) points is considered small text. The text is so small that it will not be visible when viewed or printed and can be used to hide information in a document.
Like white text, small text is commonly used to put information in documents so they can be found by search engines. Small text can also include sensitive information that was not meant to be distributed externally.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. If a task is repeated in Microsoft Word, Excel or PowerPoint, it can be automated using a macro. A macro is a series of commands and instructions that are grouped together as a single command to accomplish a task automatically.
There are several reasons to strip out custom macros. For example, macros can be set for templates that may have some amount of pre-populated data. There may be a time when the information contained in these templates should not be seen by external audiences. Another example, macros can be linked to internal databases or intranets. Having access to the internal file naming structure is generally information that most corporations do not want outside their firewall. Lastly, macros are often quite complex and, if developed in-house, may represent the company's intellectual property. If macros are included in the document, the information is freely shared with any outside party.
Microsoft Word documents only. Previous versions show the number of times that a document has been versioned over its lifetime. This function enables Microsoft Word to save prior versions of a document as a part of the electronic file.
The risk associated with previous versions is that a recipient can access any of the previous versions that have been saved. Therefore, the party reviewing the document can go back to any version and see what was changed in the document lifecycle. This metadata, while useful in some instances, can disclose sensitive information.
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Routing slips are used to create a distribution list of reviewers in a particular order. Routing slips are manually created by adding in recipients' email addresses. When files are routed, it is sent as an attachment in an email message.
Routing slips reveal the names and email addresses of the people that the document was sent to for review. This may be information that should stay confidential rather than distributed externally. An example of how this information can be used is when email addresses are put in the routing slips. If this document is then published to the Internet, the email address can be displayed for all to see.
Microsoft Word documents only. Fast saves is an option in Microsoft Word that saves just the changes that were made to a document, resulting in the history of the changes being saved with the document file. Turning fast saves off and saving the document will remove the changes and store only the final version of the document.
Like other metadata, changes saved during a fast save can expose sensitive information to external parties when viewed using a text or hex-editor. Deleted text can still exist in the electronic file. According to the Gartner Group's Research Note on Metadata in Office, "users can easily forget that metadata exists when they send the document to someone else. Some metadata is never visible, such as pieces deleted by users but not really deleted by Microsoft Office when operating with fast save turned on.
Microsoft PowerPoint documents only. Hidden slides are slides that are hidden so that they are not shown during a slide show.
A master Microsoft PowerPoint slide deck may contain some slides that are uses as backup or that are for internal use only. To prevent accidental showing of these slides, it is best to strip out any hidden slides before sending the slide deck out externally.
Microsoft Word and Microsoft Excel documents only. Documents can contain hyperlinks to other documents or Web pages and are displayed as blue underlined text. Hyperlinks in Microsoft Excel files can be seen in: a link to a cell in another Microsoft Excel document, a named link to a named reference in another Microsoft Excel document, a link to another document, an OLE link that inserts another document as an icon, and an OLE link that inserts another document as text.
Hyperlinks can maintain a link to a site that corporations may not wish to disseminate such as files that may exist on a computer's local file system, on a corporation's internal database, or on an intranet. Disclosing the file path, or the location of where the files are stored can invite potential hackers to gather sensitive corporate information.