Articles, Blog

Could PDF be a weapon in cyber warfare?

(upbeat music) – So my name is Peter Wyatt
and I’m here to present the next talk with Duff so
we’re gonna be going full blown Duff will take over at the end. So a little bit about
myself and about Duff. I’m a board member of the PDF Association and I also chair the PDF TWG,
the Technical Working Group that the PDF Association runs. And also what you’ll be learning
about through my talk today is we’ll be signing up a new
TWG called the SafeDocs TWG. I’m also the Co-Project Leader with Duff on ISO 32000 which is the
core PDF standard issue post from Dietrich in
the last presentation. And for the purpose of this talk I’m also the PDF Principal Investigator on the DARPA-funded project
called Safe Documents or what I’ll call today the SafeDocs. And I come from a software R
and D engineering background so a technical background. Duff here you mostly know, I won’t go through his credentials but just to say that Duff
does chair other TWGs. And he’ll be the Industry Lead on the DARPA SafeDocs PDF project. So now you’re probably
asking, what is SafeDocs? So let’s look at what DARPA has done. So DARPA is the U.S.Defense
Research Projects Agency. They’re a big group that
looks at long-term research or deep problems and they
invest heavily in this area. So this is a document
that they have made public and I’ll just read it out. So they wish to reduce
electronic document complexity and I think a lot of us
in the room would agree that would be a good thing
when it comes to PDF. And build verified parsars,
so verified meaning that we can prove things,
to radically improve software’s ability to reject
invalid and malicious data. So again that would all be a good thing everyone in this room would agree. And a goal they set themselves all is to regain trust in electronic documents and the ability to process them safely. So a thing they say, regain
trust, when did we lose it? I wasn’t aware that we’d lost
trust in electronic documents. So what we need to take a look at this not as wearing our PDF hats but wearing our cyber security hats. So I want you to put on your hoodies, put on your dark sunglasses, and we’re gonna take a look at PDF from a cyber security point of view. But before we do that let’s
look at the actual data that supports the assertion
that PDF is maybe unsafe. So simple data from our midst
regarding vulnerabilities. If you just do a simple
term search for the word PDF this is the data you get as
of this morning or last night. So you can see from the
graph on the left there the total matches by year. So this is generally speaking,
it’s an upward trend. We’re halfway through
this year so we can expect that by the end of this year
we’ll be somewhere up here. And that basically says the
total number of vulnerabilities. The world is getting a nastier place and if you go on the internet
I think we’d all agree that’s probably true there’s a
lot more nastiness out there. But if you look at it as a percent by year which is this graph here on the right, you’ll see that generally speaking we’re in the sort of three to 5% band which is this band here. So what do you call this kind of tracking in terms of standard vulnerabilities, in terms of PDF documents? So remember this is also out
of 117,000 almost 118,000 vulnerabilities that have
been reported since 1996. So I don’t think that’s
too bad when you consider that we’re the most ubiquitous
platform on the planet. Yesterday we were hearing
that there’s maybe a trillion PDFs produced per annum. So at three to 5% that
doesn’t seem too bad. But you do have to consider
this diary is a simple search, I only searched for the word PDF. So it doesn’t include all the
scams and all the phishing or the annoying emails,
they’re not malicious in terms of the file
itself, it’s not malicious they’re obviously trying
you to get you to click through to something that
maybe is a bit malicious. But also it doesn’t
include the nested files. So this is just the search for PDF. So maybe things are a
little bit worse than this but still for the most dominant
file format on the planet I don’t think this is too bad. But what does it really
mean when you dive into it? It’s probably a little bit
hard to see this graph. This is taken from a
conference here in the U.S. of a few weeks ago about the economics of information security. So these researchers looked
at all the vulnerabilities over about a five or six-year period which is the gray circle,
the outside circle. Then they looked at
the published exploits. So these are where people have
discovered a vulnerability and that vulnerability has then been found in exploit toolkits that the attackers use to make nasty stuff. And believe it or not you can go to GitHub and you can download this stuff. So that’s what the blue circle is and as you can see it’s
a very small percentage. And then they looked at the red circle. The red circle is the ones that based on firewall intrusion data so
where security researchers are monitoring what’s
happening through firewalls. This is where they noticed
that that particular exploit actually made it into a corporation. So again, we’re shrinking
down from a big picture of three to 5% percent to
a much smaller picture. So the data really isn’t that bad. If we then look at VirusTotal, so VirusTotal is a service
used by cyber security people and it has about 70 add-ons
more, 70, 80 antivirus products where you can upload a document or a file and you’ll get the
opinion of these 70 or 80 antivirus solutions to
tell you whether or not a file is malicious. Now not all the products support
PDF but a large number do. And in 2017 it was
reported that VirusTotal received over 12 million non-executable meaning documents missions per year. Now that obviously includes
the Office formats as well and you would all be aware of macro viruses in Office products. So again there’s still
a lot of files out there that doesn’t mean there’s
a lot of variation in them. And that’s primarily
because of the pink circle that we have above. So why did DARPA do this? Why have DARPA funded a project to look at document security from a cyber security point of view? And it really comes to this graph here. So they did a detailed
analysis back in early 2018 of all the CVUs, all the data that people have publicly published
about cyber threats. And what they found was
that the vast majority of these threats come from
parsing vulnerabilities. Now most of you who have read the paper may even be aware of
vulnerabilities such as Pathly. Now that really has nothing to do with PDF except that the root cause of that problem was an input vulnerability. The parsing of the data
stream wasn’t checked properly and that allowed somebody
to take advantage of that and then do an exploit. So the summary of this is
really be alert not alarmed. And parsing errors are the root cause of almost all vulnerabilities. And that’s why DARPA has
looked at funding a project regarding improving the
trust in our documents meaning making them more
robust to vulnerabilities. So how do cyber security
researchers think about PDF? So we have to change our
mindset here we need to think rather than thinking about
all those great things that PDF can do we need to
put down our dark sunglasses, we need to think about all the
ways that we can get a victim to do something that I want you to do. Click on my website, download
some malicious software, visit my nasty website
whatever it might be. So we’re gonna go through each
of these and discuss them. And we’ll talk a little bit
about how you as PDF developers are already addressing
this which I believe we are in a lot of cases
but also what we can do with DARPA to improve
the situation for us all. So the first thing I wanna talk about is just the usual spam phishing. So this classification are
not malicious documents. These are just documents that you get sent you receive in your
email or some other way and they’re really designed
for you to click on that URL and then visit somewhere. So the document itself is not unsafe it’s just a volume-based approach. Most of these spam phishing emails come from Office products. They’re mass produced as in one file is sent to many millions of people. And the goal as I said is to click on it. So the way that most
products nowadays work is that before you do any
action, your user is informed, do you wish to visit the website? And even if the thing on the screen says you’re going to visit,
I won’t pick on anybody and say, it’ll
actually then show the URL so you know you’re visiting and not somebody else’s website. And we’d all be familiar with
this from our experiences with browsing, lots of
browsers now show you the hint that this is the URL
that you’re clicking on. So we already had in
place well-established procedures to handle the
spam and the phishing. The next theory we should look at are the more exploitive documents. So I’ve got three types here, payload, denial of service and zero-day. So a payload is basically a PDF file that contains something malicious. And we all know about
the file attachments, the executables, the other
things that can be attached to a PDF file and the
trick there is the PDF file is trying to trick you
to open this let’s say, a free ticket or something, it’s gonna get you to do something. In a denial of service attack, this is more a payload that’s relying on crashing a PDF service
often a web service to disrupt an organization. So it’s mainly associated with disrupting an automated service and we’ve had a lot during the last two days
about automated services. And a zero-day exploit is
basicallY where the PDF file is trying to attack the operating system or the host environment. So what they’re trying to
do is to get that PDF file to attack and exploit
the operating system. So in all these cases, there is usually some
kind of programming code in the PDF file and in our
case that’s JavaScript. So that doesn’t mean that
JavaScript is nasty or evil in actual fact we would all
appreciate that JavaScript is an important part of
our business model for PDF. We need it for smart forms,
we need it for 3D PDF and we need it for many
other applications. But unfortunately a lot
of these attack methods do rely on JavaScript. So we’ll move on to social engineering a lot of you have
probably heard this term. So what does social engineering
mean in a PDF sense? It’s really about understanding
you as an organization so it’s about targeted attacks. They’re trying to learn more
about your customer or you or your organization or your processes so they can build an attack,
a more structured attack. And here’s just one example that came up just a few days ago. So this is an attack from Qatargas, one of the largest LNG
producers in the world, and somehow somebody got
into their HR website. So this is like an opt-in virus. You go to the HR website
and you apply for a job, unfortunately you download a
PDF that’s been tampered with. That PDF then has a vulnerability in it which then feeds forward and you lose some of your credentials
and before you know it you’ve got a Word document
with a massive virus pile-on. So PDF here is a step on,
it’s actually not that bad. But the whole point here is that they are social engineering
you through an opt-in process. You’ve elected to apply for a job and before you know it you’re infected. So this is a bit different to spam. Spam is when it comes to you unsolicited. This is where you’ve quite
honestly gone to a website and downloaded the PDF. When you actually upload
this exact PDF to VirusTotal this is the answer you
get so you can clearly see that there are a number
of PDF-aware technologies. They have a red cross saying
yes this is a nasty PDF. What you can’t see is the, I think I had the number
up here somewhere, 19 out of 59 processes
detected as a nasty PDF and the other ones hidden. So this does indicate
that there’s a certainly a lot of catch-up in the antivirus world in terms of detecting PDF payloads. In terms of misinformation and mistrust. So this is a new area, so this again about
fooling users into thinking that rather than an attack
that do you trust this document because something about the document it’s got a digital signature from somebody that you recognize or that you trust. That it’s got information that you believe to be true and accurate and therefore you’ll enter it into
your financial systems. And when you think about it
like that this is something just talking about trust it
doesn’t have to be electronic trust it can just be that you trust the person who sent you the email. The fact that it’s professionally
produced and you trust it. So thinking about the documents that you interact with
everyday of your life. Think about why do you
trust that document? How do you know about that trust? So some recent research out of Germany, it took this trust to a different level. They analyzed digital signatures and they found that they could actually create flaws in the digital signature. Now this is not to say that
digital signatures are weak, it’s got to do with the implementation of digital signatures. So they tested three different attacks making illegal PDFs now the key thing here is these PDFs are actually
technically speaking illegal. Invalid I should say not
illegal, they’re invalid. And we had 21 out of 22 desktop viewers and five out of seven
online validation services did not detect that the tampered files had been tampered with. So when we look at the analysis
of this what do we find? We find the fact that the complexity of the PDF specification, Dietrich in the last talk spoke
about the 1000-page PDF spec but when you think about it,
it’s not just 1000 pages of PDF it’s all the nested formats as well. So there were many nested
formats, the true types, the imaging formats, the font formats, all these other things come into play in making a PDF product. And the other part to this is what we’ve all spoken
about especially yesterday in Stream out here was
the tolerance to arrows, the permissiveness. We all talk about the issues
that we face as developers where we get PDFs and
they’re not quite valid and we have to change their implementation because our customers
demand that it works. And these are the two things that this particular attack utilize. They were utilizing the
tolerance tech narratives the fact that the PDF file was invalid. And the complexity of PDF specification that not everyone who’s writing
digital signature handles has actually probably read
all the specifications. And in the case of this particular case that had signatures
would completely bypass all these attack methods. But unfortunately they’re very complicated and if you’re not experts
then you will struggle to understand them. So moving forward, the
next one is polyglot now this is a very different situation. So this is a single file
that will simultaneously multiple valid formats. So effectively you can
rename the extension .PDF and open it in a PDF
viewer, rename it .JPEG and open it in a JPEG image viewer and rename it .HTML and open as a webpage. So as you can imagine that’s a rather complicated thing to do. There is a particular
researcher Angelo Bottoni who made a habit of creating these files as a technical challenge to himself. And this is just an example
of exactly what I described. A single PDF file that’s PDF HTML in this particular case a ZIP file and you just have to rename it. So you might think this is great. This is some great PDF magic if you will, it’s very interesting,
how can it be a threat? Well we go back to the
discussion we just had about antivirus solutions. If your antivirus firewall
detects a file coming in maybe it’s got a .HTML on the end, looks at the file and goes,
yep, that’s valid HTML and let’s it through the firewall. And the next bit of software
inside your organization then says, let me sniff the
file, let me check it out, and if it decides it’s a PDF
then they check the viruses of HTML but it’s really a PDF file. And this goes to the point of making sure that the sniffing function
I’m sure most of you would be aware of that which
means checking it very quickly and a quick look at it not
doing a properly analysis of it. And that’s really the flaw that this kind of file points out. Well need to disparate IT systems and disparate software
in our organizations that we don’t necessarily
connect all the dots. They’re just pieces put together and there are ways through the
gaps in between the pieces. Content masking, so content
masking is a new way and we’ve spoken a little bit about that over the last few days. In particular this is
about masking content where the content that you see as a human, the rendered content, maybe does not match the extracted content that your machine and your automated
processes are working with. And there’s been some
research done in this area. So this particular group demonstrated three successful invisible attacks. So the first one was
altering academic papers to fool automatic reviewer
assistants and gain the system. Obviously they had to know the algorithm that they would manage to gain them. They also were able to modify the PDFs. These were completely valid PDFs to avoid plagiarism detection
by the Turnitin tool used by many universities and schools. And lastly which is probably
a more interesting one is they were able to bias
search engine results from a few of the big
search engine providers with information not
visible in the content. And if you think about the
HTML world, the web world, we remember a number of years ago when people were gaming the system and trying to get up
that Google search rank by putting all manner of keywords into the top of their HTML page then no matter what you
searched for it always came up with that website that had nothing to do with you search term and
this is the equivalent. We’re using hidden text,
we’re using invisible text, we’re using white text, we’re using text that’s off the page, we’re using other mechanisms, content streams that
actually don’t get called or don’t get referenced,
so there’s lots of ways that they’re creating
to fool these systems. So what it indicates to us I think is that we need to be careful for fully automated
systems for our customers. That it can be possible an
automated decision tree will make a decision on content that
it really shouldn’t do. So off-page content,
invisible content, white text. There’s already tools
and many PDF products to strip out this which
we spoke about, redaction. So there’s lots of ways
that we already have to clean this and to sanitize documents. But it does indicate that
if we don’t think about this kind of thing we can
create issues for our customers. New AI forgery. Now this really isn’t a technical issue but it is unfortunately
a growing area of scams. And this is a subtle difference
between a real signature in a product that’s a valid document that is signed properly. This is a document that has an
image of that signature panel and unfortunately this
fools many, many users. This is really a UX issue in my book, we can’t as technical specification control what a documents looks like but we can have our UX engineers, our user experience engineers,
work with our engineers to define better products. We only have to think
back in the web world about the green padlock in our browsers. Five, 10 years ago that
green padlock didn’t exist. And then the green padlock came along and now we’re all been trained
without any formal training to look for that green
padlock when we buy something and if it’s not green we get very nervous and we stop what we’re doing. But this is an example where a document is trying to fool us into accepting that this content is actually valid, in this particular case
we have a real website and a PDF that’s mimicking the website. This is obviously a more complex form than your typical bit of spam email. But again many users forfeitures it looks exactly like
my website maybe I do, maybe I’m expecting
something from City Bank. So they jump in and they
put in their username and password and press a submit button. And as you can guess that submit button doesn’t go to the website
it goes somewhere else. So again you new AI forgeries
about training I think, training our users to understand and be a little bit wary of documents but also up to us as
PDF vendors to make sure that our user experience is helping them get the best experience in PDF. Privacy leakage. Again we’ve heard a lot about redaction. So this is where information leaks out of our organization through
either accidental processes like these classic examples
of redaction discussed earlier or maybe just the metadata in files. I used to work for an imaging company and a lot of photos and images that are in marketing brochures, you can expect those and
I work out what models of camera you’ve got,
what serial number it is, maybe where the photo was
taken from the geolocation and the image data,
there’s a lot of metadata that can be in files. If you don’t realize that your
customers don’t realize that. They could be telling
you a lot of information. It’s also surprising
the number of documents that without knowing an
organization you can find out people’s names, people’s positions. Maybe their phone numbers
and contact details because a tool has
automatically added metadata through a workflow with all the people’s names and positions. And this is useful for
attackers in social engineering to work out how to maybe do
a social engineering attack in your environment. And lastly, maybe a bit
of a different viewpoint is information hiding. So up until now we’ve been speaking really about PDFs coming
into our organization and about how we might manage the threat that that might pose. But this however is
about information flowing outside your organization. In a particular context
where you have somebody who wishes to leak information
so that’s very specific certainly new strings. So in technical terms this is stenography where for example you
might encode certain images with metadata, buried in the pixel data, you can’t really see
anything, send it out. In my previous role I
did work with certain defense oriented governments
that were very concerned with this actually car
manufacturers were also concerned with this as well. So information hiding is not so much that you have an attacker
trying to attack you, you have an attacker internally trying to leak information
or move information outside your organization. But it is something where
PDF is actually used for this but it’s not really
something again I think that we can do as PDF vendors. So that’s what we’ve been thinking about, is how cyber security
researchers think about PDF. This is not how we as PDF think about PDF. So why has SafeDocs come around? Duff are you gonna take over or do you want me to keep going? – So I’m here I guess to
take the sleeves and arrows with industry and assume
regard to our connection and our association (mic buzzing) work with this project with this program. DARPA is asking the question
how can we guarantee a document’s information is it safe, trustworthy and authentic? I’m gonna shut this down. So and of course we’re
not just talking about PDF we’re talking about PDF
and it’s nested formats. And that’s really the
main reason why DARPA has targeted PDF, it’s a container format. They theorize that if they
can prove their methodologies in terms of generating
something that’s radically more secure, radically less complex and thus presents a smaller attack surface while not losing any of its functionality then they’ve achieved something. So DARPA’s goal is no
less than to regain trust in electronic documents and the ability to process them safely. They’re inclined to do
this with the development of verified parsars to radically improve the software’s ability to identify, reject invalid and malicious data as Dr. Bratus’ Program Manager has stated. So our objectives, the PDF Association in the context of SafeDocs
are fairly straightforward. First and foremost the
Board and the Directors and the PDF Association authorized our involvement in this
because we want to ensure a positive result for the
PDF technology ecosystem. We have to fundamentally do no harm. And that means ensuring
that the researchers and the parsar developers
don’t head off down directions that are fundamentally anemical to the interest of the
users and the developers of PDF technologies. We want this to be safe
for the various types of vertical markets that PDF
places into various features that are critical to
PDF industry developers. So we want to ensure a positive result from the PDF ecosystem
fundamentally by retaining, ensuring that the core
value proposition PDF is respected in the program. The program centric objectives are broader in the sense that we wanna
maximize for these researchers and there are a variety
of specialists in both as it were offense and defense in terms of software security. Or maximize the likelihood
that they’ll actually achieve something useful
in their investigations of PDF technology and in
their attempts to build technologies that achieve
some of the objectives in terms of securing PDF. We want to leverage the
tech based development that those guys do in terms of identifying perhaps other approaches
that are deployable in industry context and
assist the researchers not only in their research
but also in their ideas about how they can produce transitionable code and other kinds of resources that will be useful in that context. I keep hearing a strange noise. And so finally there are artifacts that are potentially used in the industry that may indeed fall out of this research. And among these are of course
the straightforward objectives of the program include things
like parser construction and toolkits but also very importantly this work is gonna be conducted
in an open source manner. They intend to develop
large corpora of documents and probably more importantly to publish specifications and data
about these documents to enable people to be
able to characterize this information and come to
draw their own conclusions about patterns that they
see in this content. So the takeaway is (coughing) So the message really here is that PDF is a very prevalent technology and it is a rich technology
so the combination makes it attractive to people who are trying to do mean
things with computers. And it makes it attractive
from the point of view that there’s a reasonable
scale of attack surface to PDF. So to prove the point of
assisting the government in its interest in
securing this technology further and making PDF even
safer than it is today. We felt that it was a good
idea to get involved with it and to enhance the likelihood
of positive outcomes for this research. So DARPA’s SafeDocs program is gonna study this project in new ways
and we’re involved in it to advice and to inform that effort to hopefully engage the industry,
get the industry’s input. Putting it as the SafeDocs
institute and counter and address the subjects
of real interest to it. The concept for example
of a safe subset of PDF whatever that may be. And the various tooling
necessary to find ways of taking advantage of
that and the relevant or immediate products that
may come off of that effort. So to that end the PDF
Association is going to be as Peter has mentioned operating
a Technical Working Group that will have as its role
to supervise in a sense some of the conclusions and
strategies that are generated by this effort and to be
able to give industry a means of feeding back their opinions and input and thoughts through Peter, myself and directly into the SafeDocs program. Trying to inform the program to ensure that it generates useful
outcomes and of course to ensure that it doesn’t use
something or produce something that is never pole to the interests of PDF technology industry and of course the consumers who depend
upon PDF everyday. Any questions? – Is there a timeline for this stuff? – So SafeDocs has just had its kickoff. The Phase 1 program will
run for the next 18 months that’s the primary research phase. If it’s considered to be successful it’ll be followed by another
18-month program Phase 2. Which will include the
development of parsers that result from the research in Phase 1. There’s a Phase 3 that
can proceed beyond that which they call transition
and the objective of Phase 3 is after the principal
parsers have been developed in Phase 2, Phase 3 is all about enabling transition to this technology
and making it available and promoting it in fact
to various implementers. Those are the straightforward
defense interests and there are several of
those already involved in the program and then of course larger industry concerns as well. – So you say these parsers
are gonna be open source? – Yes, so all the source
code, all the corpora, everything that DARPA does is open source so it will be available. – Do you know which license? – No is the answer to that question. I think we don’t know the license hasn’t really been discussed. – They’ve spoken of it only
in maximally open terms. – So are we gonna have certified viewers? – They are not prescriptive
about the types of solutions that they envision but you may use your own imagination and you may indeed come up with that idea. It certainly a perfectly plausible idea. – Yeah cause it seems like
a lot of the vulnerabilities aren’t really PDF as much as they are– – Open sessions. – It’s all about parser
so technically speaking the method they’re
using is called language theoretic security and what that means is really defining a formal
grammar if you will for PDF. This is a very high level hand where you don’t take it too concretely. So that we can mathematically prove that when you write more parser there’s no other input that you can accept except perfectly valid PDF,
maybe some PDF safe subset. If the underlying principle
that the realities gonna kick in very quickly. (audience laughing) – It already is so– – Yeah.
– Exactly. – And you know we’ve
discussed this only yesterday I think with the fact
that we all appreciate that those are many PDFs out in the wild that don’t really comply but in reality we have to support. – So our job is really
as much as anything else is to make sure that those
researchers appreciate the scope of what they’re
actually trying to engage with and how their success or their outcome will be largely dependent
on their capacity to absorb and accommodate that scope. – Right and you can’t get
everything either I mean, you still have to deal with
the don’t push this button. – Yeah so SafeDocs is not doing
every single vulnerability but if you’re like this is the map and they’re attacking certain things and DARPA has other
programs that we’re not involved with for example the PND fakes that you should all be familiar with and you might think about them. What does a deep fake mean
in a document context? And that’s really about
trusting that document you’re reading in front of you and the belief that it’s
real and true and factual and maybe this work will
go into that later on. So DARPA is certainly
supporting a lot of work in the cyber security space
and this is just one area. Yes, Duff? – Just going to address
that four letter word that begins with an F for
font which is a subset of the PDFs classification. Because you can actuallY
get viruses in that as well. – In my presentation I
obviously wanna put any vendors but we’ve had issues with buffer overruns in image handling, buffer
overruns in 3D content, buffer overruns in font handling. It’s just all of the above. So yes it does and that’s
why specifically Duff and I stressed the unnested formats
this is just about PDF it’s about everything that we
drag in through definitions. – What’s the size of the set of resources that DARPA’s ranging for
recording for SafeDocs either in terms of
researchers, people, money, pick a metric? – So we don’t know, nobody’s told us what the total allocation is. What I can tell you is
what has been estimated to us by savvy people some of whom are associated with the project that typical DARPA program of this ilk might get about 40, $50 million in research funds in its life. – So we had a kick-off I
think it was 53 attendees but there was one big
group that wasn’t present so I’m imagining there 60, 65 people. But then it also cascades
out into universities and various other places who
will be second-level support so it’s a very big program very attentive and they really want to move forward fast. Any other questions? Thank you
– Thank you. (audience applauding) (upbeat music)

Leave a Reply

Your email address will not be published. Required fields are marked *