Data Mining & Intelligence Agencies

Moderators: DrVolin, Elvis, Jeff

Re: Data Mining & Intelligence Agencies

Postby General Patton » Tue Sep 25, 2012 12:02 pm

None of the articles mentioned that Lockheed has bought one of D-Wave's Quantum computers.

It's likely that the NSA data center in Utah has that. If you read between the lines on Bamford's post about the new data center, you'd be wondering how in the fuck you would process yottabytes or make breakthroughs against current crypto protocols. The chip isn't faster than supercomputers per se, but it can work on problems that would take supercomputers very long stretches of time to compute.

If you wanted to speculate, they could build something like Wolfram Alpha or Watson, except related to intelligence analysis and closer to artificial general intelligence, if not the real deal.

This is riding an exponential curve like most of IT technology and they've solved some problems for scaling, so quantum technology will likely be on the consumer market in 5 years or less.

Oh AES, we barely knew thee. Edit: Or maybe not, depending on the architecture of the computer. The D-wave isn't a general architecture system, it's designed for optimization, the Travelling salesman problem: ... an_problem

And it still appears to compute with a good deal of noise. IBM has made attempts on Shor's algorithm, but it remains to be seen if it would be practical with the amount of data the NSA has.
Lockheed Martin Corporation has agreed to purchase the first D-Wave One quantum computing system from D-Wave Systems Inc., according to D-Wave spokesperson Ann Gibbon.

Lockheed Martin plans to use this “quantum annealing processor” for some of Lockheed Martin’s “most challenging computation problems,” according to a D-Wave statement.

D-Wave computing systems address combinatorial optimization problems.that are “hard for traditional methods to solve in a cost-effective amount of time.”

These include software verification and validation, financial risk analysis, affinity mapping and sentiment analysis, object recognition in images, medical imaging classification, compressed sensing, and bioinformatics.


LONDON – Quantum computing has been brought a step closer to mass production by a research team led by scientists from the University of Bristol that has made a transition from using glass to silicon.

The Bristol team has been demonstrating quantum photonic effects in glass waveguides for a number of years but the use of a silicon chip to demonstrate photonic quantum mechanical effects such as superposition and entanglement, has the advantage of being a match to contemporary high volume manufacturing methods, the team claimed.

This could allow the creation of hybrid circuits that mix conventional electronic and photonic circuitry with a quantum circuit for applications such as secure communications.

It's being used in other fields as well, for instance to solve the infamous protein folding issue: ... -used.html
A team of Harvard University researchers, led by Professor Alan Aspuru-Guzik, have used Dwave's adiabatic quantum computer to solve a protein folding problem. The researchers ran instances of a lattice protein folding model, known as the Miyazawa-Jernigan model, on a D-Wave One quantum computer.

The research used 81 qubits and got the correct answer 13 times out of 10,000. However these kinds of problems usually have simple verification to determine the quality of the answer. So it cut down the search space from a huge number to 10,000. Dwave has been working on a 512 qubit chip for the last 10 months. The adiabatic chip does not have predetermined speed up amounts based on more qubits and depends upon what is being solved but in general the larger number of qubits will translate into better speed and larger problems that can be solved. I interviewed the CTO of Dwave Systems (Geordie Rose back in Dec, 2011). Usually the system is not yet faster than regular supercomputers (and often not faster than a desktop computer) for the 128 qubit chip but could be for some problems with the 512 qubit chip and should definitely be faster for many problems with an anticipated 2048 qubit chip. However, the Dwave system can run other kinds of algorithms and solutions which can do things that regular computers cannot. The system was used by Google to train image recognition systems to remove outliers in an automated way.


We already have an open source program to infer mathematical patterns from data. Another interesting trend for the next 5 years, combine sensor nets with unsupervised learning algorithms, combined with the ability to crack current cryptography standards, to watch everything everywhere. A system like that could record, monitor, process data and formulate concepts beyond any level we can imagine. ... /chong.pdf

edit 2:
Here's a comment from a PhD candidate in Physics on the D-Wave computer, though it mentions the older version:

Wave's quantum computer is an adiabatic quantum computer designed to solve optimization problems, not perform universal computations. It's architecture is not compatible with running algorithms based on the circuit model, which include all the fabled cryptography beating algorithms based on fast factoring (Shor's algorithm).

In any case, as Michael points out, 128 qubits is certainly not enough to decrypt traditional cryptosystem and there is some dispute about exactly how "quantum" their computer really is, although their Nature paper has alleviated some of these concerns. At this point, D-Wave's computer is more relevant as a proof of principle than as an actual computational device. Lockheed Martin probably bought theirs to insure they will be on the ground floor if this thing takes off.

There is a lot of back and forth because of previous hyped up claims, it may or may not be possible to implement a more general architecture with less noise.
Last edited by General Patton on Tue Sep 25, 2012 5:50 pm, edited 2 times in total.
штрафбат вперед
User avatar
General Patton
Posts: 958
Joined: Thu Nov 16, 2006 11:57 am
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby General Patton » Tue Sep 25, 2012 5:13 pm

If they have recognized and defined the simple inputs that create a complex system, e.g. a nation, then you could go about simulating it properly, without needing to be guided by current data as much.

One of the problems I ran into, while trying to figure out ways to plug psychology theories into ABM, is that we cannot gather data in real time from actual emotional states. That would seem to be a better method, and some advertising firms have started using sensors for focus groups for that purpose.

What kind of sensors would exist in our imaginary future sensor net?

Heartbeat, brainwave patterns, cortisol levels, skin conductance, pupil dilation, possibly some others I have missed. The sensors are likely somewhat redundant so we may only need a few to graph a current emotional state. In some cases the sensors exist but the range is much shorter than would be ideal.
Tests in 2010 showed that the best algorithms can pick someone out in a pool of 1.6 million mugshots 92 per cent of the time. It’s possible to match a mugshot to a photo of a person who isn’t looking at the camera too. Algorithms such as one developed by Marios Savvides’s lab at Carnegie Mellon can analyse features of a front and side view set of mugshots, create a 3D model of the face, rotate it as much as 70 degrees to match the angle of the face in the photo, and then match the new 2D image with a fairly high degree of accuracy. The most difficult faces to match are those in low light. Merging photos from visible and infrared spectra can sharpen these images, but infrared cameras are still very expensive.

Of course, it is easier to match up posed images and the FBI has already partnered with issuers of state drivers’ licences for photo comparison.
Everyone knows Moore’s Law – a prediction made in 1965 by Intel co-founder Gordon Moore that the density of transistors in integrated circuits would continue to double every 1 to 2 years. (…) Even more remarkable – and even less widely understood – is that in many areas, performance gains due to improvements in algorithms have vastly exceeded even the dramatic performance gains due to increased processor speed.

The algorithms that we use today for speech recognition, for natural language translation, for chess playing, for logistics planning, have evolved remarkably in the past decade. It’s difficult to quantify the improvement, though, because it is as much in the realm of quality as of execution time.

In the field of numerical algorithms, however, the improvement can be quantified. Here is just one example, provided by Professor Martin Grötschel of Konrad-Zuse-Zentrum für Informationstechnik Berlin. Grötschel, an expert in optimization, observes that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later – in 2003 – this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms! Grötschel also cites an algorithmic improvement of roughly 30,000 for mixed integer programming between 1991 and 2008.
The Electric Potential Sensors (EPS) are the first electrical sensors that can detect precisely the electrical activity of the heart without direct resistive contact with the body. The new sensors will make monitoring a patient's heartbeat, whilst they relax in their hospital bed or in their home, easier and less invasive than ever before. With commercial interest building quickly, the team of Sussex researchers believes the EPS will offer medical and home health professionals the opportunity to develop patient-friendly, self administered systems to monitor their vital signs with the minimum impact on their mobility. The sensitivity of these sensors means they can also be used to detect muscle signals and eye movements and, in future, will be developed to detect brain and nerve-fibre signals. The EPS research group team, based in the University of Sussex's School of Engineering and Design, is lead by Dr Robert Prance, Professor of Sensor Technology.
WASHINGTON — A system that detects the faint electric signals of beating human hearts is being used to help rescuers frantically seeking to locate people trapped under the rubble in China's horrific earthquake.
Eye tracking has long been known and used as a method to study the visual attention of individuals. There are several different techniques to detect and track the movements of the eyes. However, when it comes to remote, non-intrusive, eye tracking the most commonly used technique is Pupil Centre Corneal Reflection (PCCR). The basic concept is to use a light source to illuminate the eye causing highly visible reflections, and a camera to capture an image of the eye showing these reflections. The image captured by the camera is then used to identify the reflection of the light source on the cornea (glint) and in the pupil (See figure 4). We are then able to calculate a vector formed by the angle between the cornea and pupil reflections – the direction of this vector, combined with other geometrical features of the reflections, will then be used to calculate the gaze direction. The Tobii Eye Trackers are an improved version of the traditional PCCR remote eye tracking technology (US Patent US7,572,008). Near infrared illumination is used to create the reflection patterns on the cornea and pupil of the eye of a user and two image sensors are used to capture images of the eyes and the reflection patterns. Advanced image processing algorithms and a physiological 3D model of the eye are then used to estimate the position of the eye in space and the point of gaze with high accuracy.

Note: The below isn't a remote sensor:
This study describes the functioning of a novel sensor to measure cortisol concentration in the interstitial fluid (ISF) of a human subject. ISF is extracted by means of vacuum pressure from micropores created on the stratum corneum layer of the skin. The pores are produced by focusing a near infrared laser on a layer of black dye material attached to the skin. The pores are viable for approximately three days after skin poration. Cortisol measurements are based on electrochemical impedance (EIS) technique. Gold microelectrode arrays functionalized with Dithiobis (succinimidyl propionate) self-assembled monolayer (SAM) have been used to fabricate an ultrasensitive, disposable, electrochemical cortisol immunosensor. The biosensor was successfully used for in-vitro measurement of cortisol in ISF. Tests in a laboratory setup show that the sensor exhibits a linear response to cortisol concentrations in the range 1 pm to 100 nM. A small pilot clinical study showed that in-vitro immunosensor readings, when compared with commercial evaluation using enzyme-linked immunoassay (ELISA) method, correlated well with cortisol levels in saliva and ISF. Further, circadian rhythm could be established between the subject's ISF and the saliva samples collected over 24 hours time-period. Cortisol levels in ISF were found reliably higher than in saliva. This Research establishes the feasibility of using impedance based biosensor architecture for a disposable, wearable cortisol detector. The projected commercial in-vivo real-time cortisol sensor device, besides being minimally invasive, will allow continuous ISF harvesting and cortisol monitoring over 24 hours even when the subject is asleep. Forthcoming, this sensor could be interfaced to a wireless health monitoring system that could transfer sensor data over existing wide-area networks such as the internet and a cellular phone network to enable real-time remote monitoring of subjects.
Emotional or physical stresses induce a surge of adrenaline in the blood stream under the command of the sympathetic
nerve system, which, cannot be suppressed by training. The onset of this alleviated level of adrenaline triggers a number
of physiological chain reactions in the body, such as dilation of pupil and an increased feed of blood to muscles etc. This
paper reports for the first time how Electro-Optics (EO) technologies such as hyperspectral [1,2] and thermal imaging[3]
methods can be used for the detection of stress remotely. Preliminary result using hyperspectral imaging technique has
shown a positive identification of stress through an elevation of haemoglobin oxygenation saturation level in the facial
region, and the effect is seen more prominently for the physical stressor than the emotional one. However, all results
presented so far in this work have been interpreted with respected to the base line information as the reference point, and
that really has limited the overall usefulness of the developing technology. The present result has highlighted this
drawback and it prompts for the need of a quantitative assessment of the oxygenation saturation and to correlate it
directly with the stress level as the first priority of next stage of research.

Long range brainwave reading and entrainment is iffy.

Over time enhanced interrogation will have a stronger neuroscience focus:
A team of security researchers from Oxford, UC Berkeley, and the University of Geneva say that they were able to deduce digits of PIN numbers, birth months, areas of residence and other personal information by presenting 30 headset-wearing subjects with images of ATM machines, debit cards, maps, people, and random numbers in a series of experiments. The paper, titled “On the Feasibility of Side-Channel Attacks with Brain Computer Interfaces,” represents the first major attempt to uncover potential security risks in the use of the headsets.

“The correct answer was found by the first guess in 20% of the cases for the experiment with the PIN, the debit cards, people, and the ATM machine,” write the researchers. “The location was exactly guessed for 30% of users, month of birth for almost 60% and the bank based on the ATM machines for almost 30%.”
штрафбат вперед
User avatar
General Patton
Posts: 958
Joined: Thu Nov 16, 2006 11:57 am
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby Wombaticus Rex » Wed Sep 26, 2012 5:45 pm

^^Thank you for a massive headfuck of a contribution to this thread, General.
User avatar
Wombaticus Rex
Posts: 10385
Joined: Wed Nov 08, 2006 6:33 pm
Location: Vermontistan
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby General Patton » Wed Sep 26, 2012 7:02 pm

No problem.

Might be interested in this stuff too:

CIA created paper on intelligence analysis tradecraft: ... -apr09.pdf

Anon guide on fooling facial recognition scanners: ... f-the-law/

With the following comment: Whoever wrote that is a muppet. Clear plastic masks work much less than 100% of the time and will likely be useless in 2 years or less. None of the methods distort the camera's ability to match the distance between your eyes. Facial movements can distort some of the lips, nose and ears but again that won't last long. If you're wearing LED's, they probably aren't powerful enough to blind the camera. Higher powered (200mW+) multi-color lasers may be a better solution, as it will be harder to filter out the different spectrums of light. Again, the question is for how long. Gesture recognition technology is moving faster than I can keep up with.
Some of it is as safe as we think it can be, and some of it is not safe at all. The number one rule of “signals intelligence” is to look for plain text, or signaling information—who is talking to whom. For instance, you and I have been emailing, and that information, that metadata, isn’t encrypted, even if the contents of our messages are. This “social graph” information is worth more than the content. So, if you use SSL-encryption to talk to the OWS server for example, great, they don’t know what you’re saying. Maybe. Let’s assume the crypto is perfect. They see that you’re in a discussion on the site, they see that Bob is in a discussion, and they see that Emma is in a discussion. So what happens? They see an archive of the website, maybe they see that there were messages posted, and they see that the timing of the messages correlates to the time you were all browsing there. They don’t need to know to break a crypto to know what was said and who said it.

Traffic analysis. It’s as if they are sitting outside your house, watching you come and go, as well as the house of every activist you deal with. Except they’re doing it electronically. They watch you, they take notes, they infer information by the metadata of your life, which implies what it is that you’re doing. They can use it to figure out a cell of people, or a group of people, or whatever they call it in their parlance where activists become terrorists. And it’s through identification that they move into specific targeting, which is why it’s so important to keep this information safe first.

For example, they see that we’re meeting. They know that I have really good operational security. I have no phone. I have no computer. It would be very hard to track me here unless they had me physically followed. But they can still get to me by way of you. They just have to own your phone, or steal your recorder on the way out. The key thing is that good operational security has to be integrated into all of our lives so that observation of what we’re doing is much harder. Of course it’s not perfect. They can still target us, for instance, by sending us an exploit in our email, or a link in a web browser that compromises each of our computers. But if they have to exploit us directly, that changes things a lot. For one, the NYPD is not going to be writing exploits. They might buy software to break into your computer, but if they make a mistake, we can catch them. But it’s impossible to catch them if they’re in a building somewhere reading our text messages as they flow by, as they go through the switching center, as they write them down. We want to raise the bar so much that they have to attack us directly, and then in theory the law protects us to some extent.

And the police can potentially push updates onto your phone that backdoor it and allow it to be turned into a microphone remotely, and do other stuff like that. The police can identify everybody at a protest by bringing in a device called an IMSI catcher. It’s a fake cell phone tower that can be built for 1500 bucks. And once nearby, everybody’s cell phones will automatically jump onto the tower, and if the phone’s unique identifier is exposed, all the police have to do is go to the phone company and ask for their information.
But iPhones, for instance, don’t have a removable battery; they power off via the power button. So if I wrote a backdoor for the iPhone, it would play an animation that looked just like a black screen. And then when you pressed the button to turn it back on it would pretend to boot. Just play two videos.

So the bigger question is communication networks. Simple stuff will be low cost expendable drones that stay operational for long periods and exchange messages. Right now batteries and renewable's like Solar aren't powerful enough to support something like that for most purposes, but forecast it out to 2014 and it becomes a lot more practical.

Private satellite launches are starting to boom as well. Expect a serious decline in cost over time, including adding 3d printers (an experiment by Scott Summit has shown objects made are structurally the same as when printed on Earth) on satellites to manufacture needed materials in orbit to save on launch costs. So you end up with satellites, 10 years from now, that could be the size of a postage stamp and carry email and other basic services.

There is already an agency that works on the generation after next space technology: ... elease.pdf

Top 10 data mining mistakes
Avoid common pitfalls on the path to data mining success
Mining data to extract useful and enduring patterns is a skill arguably more art than science. Pressure enhances the appeal of early apparent results, but it’s too easy to fool yourself. How can you resist the siren songs of the data and maintain an analysis discipline that will lead to robust results? What follows are the most common mistakes made in data mining. Note: The list was originally a Top 10, but after compiling the list, one basic problem remained – mining without proper data. So, numbering like a computer scientist (with an overflow problem), here are mistakes Zero to 10.

ZERO Lack proper data.
To really make advances with an analysis, one must have labeled cases, such as an output variable, not just input variables. Even with an output variable, the most interesting type of observation is usually the most rare by orders of magnitude. The less probable the interesting events, the more data it takes to obtain enough to generalize a model to unseen cases. Some projects shouldn’t proceed until enough critical data is gathered to make them worthwhile.

ONE Focus on training.
Early machine learning work often sought to continue learning (refining and adding to the model) until achieving exact results on known data – which, at the least, insufficiently respects the incompleteness of our knowledge of a situation. Obsession with getting the most out of training cases focuses the model too much on the peculiarities of that data to the detriment of inducing general lessons that will apply to similar, but unseen, data. Try resampling, with multiple modeling experiments and different samples of the data, to illuminate the distribution of results. The mean of this distribution of evaluation results tends to be more accurate than a single experiment, and it also provides, in its standard deviation, a confidence measure.

TWO Rely on one technique.
For many reasons, most researchers and practitioners focus too narrowly on one type of modeling technique. At the very least, be sure to compare any new and promising method against a stodgy conventional one. Using only one modeling method forces you to credit or blame it for the results, when most often the data is to blame. It’s unusual for the particular modeling technique to make more difference than the expertise of the practitioner or the inherent difficulty of the data. It’s best to employ a handful of good tools. Once the data becomes useful, running another familiar algorithm, and analyzing its results, adds only 5-10 percent more effort.

THREE Ask the wrong question.
It’s important first to have the right project goal or ask the right question of the data. It’s also essential to have an appropriate model goal. You want the computer to feel about the problem like you do – to share your multi-factor score function, just as stock grants give key employees a similar stake as owners in the fortunes of a company. Analysts and tool vendors, however, often use squared error as the criterion, rather than one tailored to the problem.

FOUR Listen (only) to the data.
Inducing models from data has the virtue of looking at the data afresh, not constrained by old hypotheses. However, don’t tune out received wisdom while letting the data speak. No modeling technology alone can correct for flaws in the data. It takes careful study of how the model works to understand its weakness. Experience has taught once brash analysts that those familiar with the domain are usually as vital to the solution as the technology brought to bear.

FIVE Accept leaks from the future.
Take this example of a bank’s neural network model developed to forecast interest rate changes. The model was 95 percent accurate – astonishing given the importance of such rates for much of the economy. Cautiously ecstatic, the bank sought a second opinion. It was found that a version of the output variable had accidentally been made a candidate input. Thus, the output could be thought of as only losing 5 percent of its information as it traversed the network. Data warehouses are built to hold the best information known to date; they are not naturally able to pull out what was known during the timeframe that you wish to study. So, when storing data for future mining, it’s important to date-stamp records and to archive the full collection at regular intervals. Otherwise, it will be very difficult to recreate realistic information states, leading to wrong conclusions.

SIX Discount pesky cases.
Outliers and leverage points can greatly affect summary results and cloud general trends. Don’t dismiss them; they could be the result. When possible, visualize data to help decide whether outliers are mistakes or findings. The most exciting phrase in research is not the triumphal “Aha!” of discovery, but the puzzled uttering of “That’s odd.” To be surprised, one must have expectations. Make hypotheses of results before beginning experiments.

SEVEN Extrapolate.
We tend to learn too much from our first few experiences with a technique or problem. Our brains are desperate to simplify things. Confronted with conflicting data, early hypotheses are hard to dethrone - we’re naturally reluctant to unlearn things we’ve come to believe, even after an upstream error in our process is discovered. The antidote to retaining outdated stereotypes about our data is regular communication with colleagues about the work, to uncover and organize the unconscious hypotheses guiding our explorations.

EIGHT Answer every inquiry.
If only a model answered “Don’t know!” for situations in which its training has no standing! Take the following example of a model that estimated rocket thrust using engine temperature, T, as an input. Responding to a query where T = 98.6 degrees provides ridiculous results, as the input, in this case, is far outside the model’s training bounds. So, how do we know where the model is valid; that is, has enough data close to the query by which to make a useful decision? Start by noting whether the new point is outside the bounds, on any dimension, of the training data. But also pay attention to how far away the nearest known data points are.

NINE Sample casually.
The interesting cases for many data mining problems are rare and the analytic challenge is akin to finding needles in a haystack. However, many algorithms don’t perform well in practice, if the ratio of hay to needles is greater than about 10 to 1. To obtain a near-enough balance, one must either down-sample to remove most common cases or up-sample to duplicate rare cases. Yet it is a mistake to do either casually. A good strategy is to “shake before baking”; that is, to randomize the order of a file before sampling. Split data into sets first, then up-sample rare cases in training only. A stratified sample will often save you trouble. Always consider which variables need to be represented in each data subset and sample separately.

TEN Believe the best model.
Don’t read too much into models; it may do more harm than good. Too much attention can be paid to particular variables used by the best data mining model – which likely barely won out over hundreds of others of the millions (to billions) tried – using a score function only approximating your goals, and on finite data scarcely representing the underlying data-generating mechanism. Better to build several models and interpret the resulting distribution of variables, rather than the set chosen by the single best model.

How will we succeed?
Modern tools, and harder analytic challenges, mean we can now shoot ourselves in the foot with greater accuracy and more power than ever before. Success is improved by learning from experience; especially our mistakes. So go out and make mistakes early! Then do well, while doing good, with these powerful analytical tools.

*From the book, Handbook of Statistical Analysis & Data Mining Applications by Bob Nisbet, John Elder and Gary Miner. Copyright 2009. Published by arrangement with John Elder.
штрафбат вперед
User avatar
General Patton
Posts: 958
Joined: Thu Nov 16, 2006 11:57 am
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby General Patton » Thu Sep 27, 2012 9:46 pm

Good overview of government simulation programs: ... lation.htm

A lot of it is written in JAVA. A corporate programming language if there ever was one.
In the last three years, America’s military and intelligence agencies have spent more than $125 million on computer models that are supposed to forecast political unrest. It’s the latest episode in Washington’s four-decade dalliance with future-spotting programs. But if any of these algorithms saw the upheaval in Egypt coming, the spooks and the generals are keeping the predictions very quiet.

Instead, the head of the CIA is getting hauled in front of Congress, making calls about Egypt’s future based on what he read in the press, and getting proven wrong hours later. Meanwhile, an array of Pentagon-backed social scientists, software engineers and computer modelers are working to assemble forecasting tools that are able to reliably pick up on geopolitical trends worldwide. It remains a distant goal.

“All of our models are bad, some are less bad than others,” says Mark Abdollahian, a political scientist and executive at Sentia Group, which has built dozens of predictive models for government agencies.

“We do better than human estimates, but not by much,” Abdollahian adds. “But think of this like Las Vegas. In blackjack, if you can do four percent better than the average, you’re making real money.”

Over the past three years, the Office of the Secretary of Defense has handed out $90 million to more than 50 research labs to assemble some basic tools, theories and processes than might one day produce a more predictable prediction system. None are expected to result in the digital equivalent of crystal balls any time soon.

In the near term, Pentagon insiders say, the most promising forecasting effort comes out of Lockheed Martin’s Advanced Technology Laboratories in Cherry Hill, New Jersey. And even the results from this Darpa-funded Integrated Crisis Early Warning System (ICEWS) have been imperfect, at best. ICEWS modelers were able to forecast four of 16 rebellions, political upheavals and incidents of ethnic violence to the quarter in which they occurred. Nine of the 16 events were predicted within the year, according to a 2010 journal article [.pdf] from Sean O’Brien, ICEWS’ program manager at Darpa.

Darpa spent $38 million on the program, and is now working with Lockheed and the United States Pacific Command to make the model a more permanent component of the military’s planning process. There are no plans, at the moment, to use ICEWS for forecasting in the Middle East.
A second approach is to look at the big social, economic and demographic forces at work in a region — the average age, the degree of political freedom, the gross domestic product per capita — and predict accordingly. This “macro-structural” approach can be helpful in figuring out long-term trends, and forecasting general levels of instability; O’Brien relied on it heavily, when he worked for the Army. For spotting specific events, however, it’s not enough.

The third method is to read the news. Or rather, to have algorithms read it. There are plenty of programs now in place that can parse media reports, tease out who is doing what to whom, and then put it all into a database. Grab enough of this so-called “event data” about the past and present, the modelers say, and you can make calls about the future. Essentially, that’s the promise of Recorded Future, the web-scouring startup backed by the investment arms of Google and the CIA.

But, of course, news reports are notoriously spotty, especially from a conflict zone. It’s one of the reasons why physicist Sean Gourley’s much heralded, tidy-looking equation to explain the chaos of war failed to impress in military circles. Relying on media accounts, it was unable to forecast the outcome of the 2007 military surge in Iraq.

ICEWS is an attempt to combine all three approaches, and ground predictions in social science theory, not just best guesses. In a preliminary test, the program was fed event data about Pacific nations from 2004 and 2005. Then the software was asked to predict when and where insurrections, international crises and domestic unrest would occur. Correctly calling nine of 16 events within the year they happened was considered hot stuff in the modeling world.

Palintir chose a different path, essentially working on reducing interface friction between computers and humans. The company itself may serve as the bridge to an AGI government assistant, Thiel is heavily involved in AGI research through Vicarious. It will also be interesting to see if 3d goggles like Oculus Rift will help reduce friction.

In 2005, the online chess-playing site hosted what it called a “freestyle” chess tournament in which anyone could compete in teams with other players or computers. Normally, “anti-cheating” algorithms are employed by online sites to prevent, or at least discourage, players from cheating with computer assistance. (I wonder if these detection algorithms, which employ diagnostic analysis of moves and calculate probabilities, are any less “intelligent” than the playing programs they detect.)

Lured by the substantial prize money, several groups of strong grandmasters working with several computers at the same time entered the competition. At first, the results seemed predictable. The teams of human plus machine dominated even the strongest computers. The chess machine Hydra, which is a chess-specific supercomputer like Deep Blue, was no match for a strong human player using a relatively weak laptop. Human strategic guidance combined with the tactical acuity of a computer was overwhelming.

The surprise came at the conclusion of the event. The winner was revealed to be not a grandmaster with a state-of-the-art PC but a pair of amateur American chess players using three computers at the same time. Their skill at manipulating and “coaching” their computers to look very deeply into positions effectively counteracted the superior chess understanding of their grandmaster opponents and the greater computational power of other participants. Weak human + machine + better process was superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process.
One approximation might be product as a simple linear amplification. Let’s imagine a function, a(h,c), in which the analytic power (a) is the product of power of the human (h) and the computing power of the chess engine being used (c). This gives us the equation:


H – this is power of the analyst. In chess, the value of this terms varies widely between players; in designing real-world data analysis systems, this is more or less a constant (which is why h above becomes H below). Of course there are differing levels of expertise, training, and raw ability amongst the user population, but when we design systems, it’s with the average case in mind.
c – computing power. How fast are the machines? How well do they scale? How efficiently do they perform the data tasks at hand? Palantir spends significant engineering effort on optimizing the c term, but most of the growth in this term comes from the layers we depend on, built by companies like Intel, Sun, Oracle, etc.
f – friction. How easy is it to bring c to bear on the problem? Note that when we talk about friction of interface, this is not exclusively referring to user interface. More generally, friction can be present at any interface between two systems: data-software, software-software, human-software, etc. The f that we consider in this simple model is sum total system friction.


When we discuss friction in real-world analysis systems, the friction actually exists at multiple levels:

Creating an analysis model that will enable answering the questions that need to be explored
Integrating the data into a single coherent view of the problem
Enabling analysis tools to efficiently query and load the data
Exposing APIs that allow developers to develop custom solutions quickly and efficiently for modeling and analysis tasks not covered by general tools
User interface that makes the tools easy, enjoyable, and quick to use

Another cool thing: Karto, an open source 2d SLAM (mapping system) along with other sweet stuff, used for exploring caves among other things:

This sort of thing could be used for HUDs, particularly if we improve designs of quadrotors so we can use them indoors. This is another area where computer vision, including gesture recognition, is going to shake things up. The concept of data mining is extending as technology develops more sensors to take in data. ... e-body.htm
A new study, published in the journal Science, details how scientists have created a tiny, fully functional electronic device capable of vanishing within their environment, like in the body or in water, once they are no longer needed or useful. There are already implants that dispense drugs or provide electrical stimulation but they do not dissolve.

The latest creation is an early step in a technology that may benefit not only medicine, like enabling the development of medical implants that don't need to be surgically removed or the risk of long-term side effects, but also electronic waste disposal.
Researchers led by John Rogers, a materials scientist at the University of Illinois at Urbana-Champaign, Fiorenzo Omenetto, a biomedical engineer at Tufts University in Medford, Massachusetts, and Youggang Huang of Northwestern University have already designed an imaging system that monitors tissue from inside a mouse, a thermal patch that prevents infection after a surgical site is stitched up, solar cells as well as strain and temperature sensors

Abstract: ... 0.abstract
штрафбат вперед
User avatar
General Patton
Posts: 958
Joined: Thu Nov 16, 2006 11:57 am
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby Wombaticus Rex » Sat Nov 10, 2012 11:08 am

Good piece here: ... ytics.html

Fantasy Analytics

Sometimes it just amazes me what people think is computable given their actual observation space. At times you have to look them in the eye and tell them they are living in fantasyland.

Here is how an example conversation:

Me: “Tell me about your company.”

Customer: “We are in the business of moving things through supply chains.”

Me: “What do you want to achieve with analytics?”

Customer: “We want to find bombs in the supply chain.”

Me: “COOL!”

Me: “Tell me about your available observation space.”

Customer: “We have information on the shipper and receiver.”

“We also know the owner of the plane, train, truck, car, etc.”

“And the people who operate these vehicles too.”

Me: “Nice. What else do you have?”

Customer: “We have the manifest – a statement about the contents.”

Me: “Excellent. What else you got?”

Customer: “That’s it.”

Me: “WHAT?!"



The problem being; often the business objectives (e.g., finding a bomb) are simply not possible given the proposed observation space (data sources).

(Unless, in this case, the perpetrator writes the word “BOMB” on the manifest. And only idiots do that. And luckily we don’t have to worry much about the idiots as they run out of gas on the way to the operation and take wet matches to their fuses.)

When we software engineering folks get overly excited and run off and build systems with little forethought about the balance between the mission objectives and the observation space, there is a risk the system will be a useless piece of crud – so many false positives and false negatives – the value of the system not worth the cost.

As I have no interest in spending intense chunks of my life building pointless systems … when initially scoping a project I recommend first qualifying the available observation space to determine if it is sufficient to deliver on the mission objectives. And if the available observation space is insufficient, then one must first figure out if/how the observation space can be appropriately widened.

In case you are interested, here are the some of the ways I try to mitigate these risks:

Qualifying Observation Spaces

1. Ask for real examples from the past of things they would like to detect (opportunity or risk), and then look in the real data to see if, upon human inspection, it is discoverable.
2. If real examples from the past cannot be detected in the provided data sources, I tell the them “not even a sentient being could discover this.”
3. Have them name their data sources and the data elements (key features).
4. Then, just because they say a data source has certain features, go look yourself – I can’t tell you how many times I go take a look and find key columns are empty or so dirty that the value of that data source is negligible.
5. If the data sources share common features between them (e.g., customer number, address, phone number, etc.) then generally more is good.
6. For those data sources that have no (or few useful) shared features (e.g., one data source has name and address and the other data source has stock symbol and stock price) then generally this is not so good.

Widening Observation Spaces

1. There will be many cases where it becomes necessary to help the customer think about widening their observation space … if they are ever to realize their hopes and dreams (business objectives).
2. Conjuring up additional data to expand the observation space is quite an art and requires real-world understanding of what and how data flows inside the walls and outside the walls as well the legal and policy ramifications.
3. Generally one starts looking for new data sources in this order: 1) other stuff inside the walls that you already collect (e.g., product returns), 2) external data that can be purchased (e.g., marketing flags like “presence of children” and “income indicators” as routinely sold by data aggregators). Of course there are other options like collecting more data themselves (e.g., adding a field to a web page so their customers can express sentiment, capturing the fingerprint of the device during on-line transactions, etc.)
4. If you are trying to catch bad guys, hope that some of the data sources would be unknown or non-intuitive to their adversary (if the bad guys know you have cameras on these four streets, then they will take the fifth street).
5. Beware of social media: There is much allure to the idea that one can computationally map Twitter statements (about your company/brand) to which customer said it. Go take a look yourself and see how often the Tweet account contains sufficient features to make an accurate identity assertion. I think you will see the frequency in which an identity can be asserted is underwhelming. Different countries and different kinds of social sites will have different statistics. In any case, be wary and look for yourself first.
6. Now let’s say one has a list of potentially new data sources to use. Then the next question is how to prioritize all these possibilities. Again, there are a lot of ways to think about this – but here are a few common ways I think about this: A) Data that improves the ability to count or relate entities (e.g., a source that may contains new identifiers like email addresses) so that one can discover that two customers you thought were different are more likely the same customer; B) Data that brings more facts (e.g., what, where, when, how many, how much); C) Diverse data potentially containing identifiers and facts in disagreement (e.g., this fact indicates they are here, but that fact shows they really may be over there – helpful if trying to keep strangers from using your credit card).
7. Finally, don’t forget there will be plenty of times that the mission objectives cannot be achieved because the necessary observation space is not available. In which case, punt.

The above list is somewhat off the cuff, certainly incomplete. So … please consider it a starter kit and hack at it any which way you like …
User avatar
Wombaticus Rex
Posts: 10385
Joined: Wed Nov 08, 2006 6:33 pm
Location: Vermontistan
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby Wombaticus Rex » Tue Dec 25, 2012 4:40 pm

The GOP Talent Gap

The 2012 election should be a wake-up call for those who raise and spend money for the Republican Party.

All too often, how we run campaigns has been untethered from scientific rigor, and without any real-time certainty whether something is working or not. Aggressiveness is praised, and hard-hitting TV ads have come to be seen as the sine qua non of an aggressive campaign. Thanks to this worldview, billions were poured into presidential and down-ballot television advertisements out of a conviction that these ads would move numbers.

Donors, for their part, were continually pressed to double down on more ads. Tweeted Republican media strategist Rick Wilson, quoting a conversation with a mega-donor, "Every *ing conf call, 'We're good but we need 1000 more GRP in X one, not even me, drilling in enough.'"

But if all these ads had the desired effect, it was not always apparent in the election returns. That should be obvious to Republicans from the fact that Obama won reelection, while the Democrats picked up two U.S. Senate seats. But it's also apparent at the county level in some cases, and in unexpected ways.

Take the case of Lucas County, Ohio. For all the talk of the pummeling Romney received in paid media over the auto bailout, Obama fared no better than he did in other urban centers statewide in Lucas County, home of auto-centric Toledo (the town Jeep was supposed to be shipping jobs from, according to a Romney ad). Nor was Lucas County was among the 18 Ohio counties -- mostly in the south of the state -- where the president did better than he did in 2008.

While television ads still play an important role, particularly downballot, the election results clearly show that Republican campaigns need to be just as aggressive with their grassroots outreach, online persuasion, and data collection and analysis as their media buys.

After the 2008 election, Obama's campaign manager David Plouffe outlined a key shift in how the campaign had set priorities for itself. The campaign spent its first dollars fully funding grassroots organizers in swing states, and then funded TV out of what was left over. A groundbreaking digital operation ensured that the campaign had ample resources to do both. The Obama re-election campaign repeated the strategy.

Given how Obama's ground game helped him outperform the final polling margins in key swing states this year, such as Florida and Colorado, the fact that the Republican campaign class has failed to adapt is striking.

How might future Republican campaigns and outside groups spend money differently?

A disproportionate amount of postmortem coverage has focused on Obama's data and technology operation which was bigger -- though also qualitatively different -- than 2008. Instead of relying on the magic of a youthful candidate, big rallies, and racking up a billion minutes of view time on YouTube, Obama 2012 used quantitative analysis to squeeze out every last advantage it could, reflecting the "grind it out" mentality of this year's campaign.

Given the attention, it would be only natural for GOP donors and operatives looking for ways to win in 2014 and 2016 to fixate on replicating the Big Data campaign and seek out data scientists, behavioral economists, and silver-bullet technologies in an effort to catch up.

They'll need to: It is true that the Democrats are ahead in the race to master the science of winning elections. But technology isn't everything. And if Republicans take the wrong lessons from this defeat, they could find themselves in an even bigger hole four years from now.

Recruit the Best and the Brightest, From Everywhere

The most pressing and alarming deficit Republican campaigns face is in human capital, not technology. From recruiting Facebook co-founder Chris Hughes in 2008, to Threadless CTO Harper Reed in 2012, Democrats have imported the geek culture of Silicon Valley's top engineers into their campaigns. This has paid significant dividends for two election cycles running.

Technology is ideologically neutral and can be built or appropriated by either party. A campaign workforce well versed in the skills needed to win the modern campaign is much harder to replicate than a program. Creative thinking is a necessity.

While there are many brilliant minds in the upper echelons of the Republican data and technology world -- including those who built the first national voter file -- the bench is not very deep. The Republican campaign world by and large does not demand technologically deep solutions, or much more than a glorified WordPress blog for campaign websites. Thus, the market largely does not supply them.

Facing a shortage of tech talent, campaigns largely treat digital as an entry-level job or draft talent from their field operation. While these young operatives thrive in the rapid-fire world of social media, longer-term data and technology projects can fall by the wayside. Thus, when it comes time to build intensive applications like the ill-fated Orca, Republican campaigns must go far afield to non-political vendors who don't understand the special kind of hell that is Election Day.

Whether recruited from Silicon Valley or not, it is clear that Chicago assembled a team of the highest caliber developers, designers, number-crunchers and user experience people in the country. It was technology masterfully executed, with a human touch.

Does this mean that this Democratic advantage is permanent? It will be if the operatives and funders in the Republican Party come to believe that the problem is merely one of buying a few shiny tech objects, rather than doing the hard work of recruiting a new generation of technical and data talents to remake the culture of Republican campaigns.

Today, Republican campaigns think they have the data box checked when they buy a voter file or do rudimentary website "cookie" targeting.

This is data dumbed-down, and it's not how the Obama campaign won.

What really matters is extracting insight from the sea of data to determine what is actually happening on the ground. Leveraging it requires a specialized set of skills you don't normally find in politics, where you typically focus on maximizing the raw number of voter contacts. Every campaign needs analysts whose job it is to run regression analyses on that night's wave of voter ID calls to find hidden patterns in the data, and continually optimizing based on performance.

The tools to do this, at least in the world of data analysis, are mature and ready to use by both parties. A story about how Target was able to look at purchase history to infer that a teenage girl was pregnant before her own father knew achieved mythical status after it was published by New York Times reporter Charles Duhigg in his book The Power of Habit. The data scientist who came up with the algorithm lived in a sleepy (and likely deep red) Minneapolis suburb. Data modeling isn't the exclusive province of liberal precincts in Palo Alto; it has been a staple of corporate America (and on Wall Street) for years, driving billions in profits.

Republicans must look for new blood to fill this talent gap everywhere -- from the libertarian-minded minority in Silicon Valley to corporate America. Republican donors should only invest in projects that focus on talent first. For tech startups that go big, it's usually not because of the technology, but because of the team.

Don't Copy -- Jump to the Next Curve

In 2004, email and websites were new and by 2008, they were mature. In 2008, social media was new and by 2012, it was mature. In 2012, mobile and data science were the two emerging trends in political campaigns. These are likely to be mature technologies that operate at a significantly larger scale in 2016.

Technology entrepreneurs are constantly fighting to "jump to the next curve," avoiding obsolescence to ride the wave of a rising technology from newness to maturity. We can partially predict what the mature technologies of 2016 will be by looking at what the new ones were this year.

Just as venture capitalists would reject pitches from companies aiming to become the next Facebook or Google, as their business models seem fairly secure, Republican donors should apply a similar framework to evaluating technology projects. Is the project trying to solve a problem which has already been solved, or whose relevance is on the decline? If so, they shouldn't invest.

Sometimes, problems persist for a long time until a shift in the technology makes a solution possible. Success with mobile donations has long eluded campaigns, until the Obama campaign's "Quick Donate" which stored supporters credit card information and allowed them, in the words of supporters I've tweeted with, to "drunk donate" with a single click.

Test Everything

Campaign operatives are not all-knowing seers, and even the best have weathered their share of defeats. The industry needs a much-needed dose of modesty, and every claim about what works and what doesn't needs to be subjected to rigorous testing and evaluation, especially in a world where media budgets have not caught up to the fact that an increasing amount of content is consumed online.

Operatives -- even digital ones -- need to be willing to subject everything they do to randomized experiments to show what's working and what isn't. (If we did, we'd likely never hear another GOTV robocall again.) The top performers may not be the ones who hit on the first try -- but those who can tweak and adjust to achieve the best result over the long run. In a political industry known for its brashness, the key to success moving forward may be an introspective and intellectually curious mindset that comes with constant testing and refinement. This year showed that false certainty in the tried-and-true tactics of yesteryear cannot be an option in 2016.
User avatar
Wombaticus Rex
Posts: 10385
Joined: Wed Nov 08, 2006 6:33 pm
Location: Vermontistan
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby JackRiddler » Mon Mar 11, 2013 8:41 pm

Phew, thanks for that article WR. Someone I know was interested in it.

Here's the original link by the way: ... ap/265333/
We meet at the borders of our being, we dream something of each others reality. - Harvey of R.I.

To Justice my maker from on high did incline:
I am by virtue of its might divine,
The highest Wisdom and the first Love.

TopSecret WallSt. Iraq & more
User avatar
Posts: 14654
Joined: Wed Jan 02, 2008 2:59 pm
Location: New York City
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby General Patton » Fri Mar 22, 2013 1:54 pm ... 3/19a.aspx
Machine learning – the ability of computers to understand data, manage results, and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Even a team of specially-trained machine learning experts makes only painfully slow progress due to the lack of tools to build these systems.

The Probabilistic Programming for Advanced Machine Learning (PPAML) program was launched to address this challenge. Probabilistic programming is a new programming paradigm for managing uncertain information. By incorporating it into machine learning, PPAML seeks to greatly increase the number of people who can successfully build machine learning applications and make machine learning experts radically more effective. Moreover, the program seeks to create more economical, robust and powerful applications that need less data to produce more accurate results – features inconceivable with today’s technology.

“We want to do for machine learning what the advent of high-level program languages 50 years ago did for the software development community as a whole,” said Kathleen Fisher, DARPA program manager.

This is the general thrust of computation: decrease interface friction. Make syntax less important, integrate it with the abstract world.
штрафбат вперед
User avatar
General Patton
Posts: 958
Joined: Thu Nov 16, 2006 11:57 am
Blog: View Blog (0)

Re: Data Mining & Intelligence Agencies

Postby coffin_dodger » Fri Mar 22, 2013 5:47 pm

CIA's Gus Hunt On Big Data: We 'Try To Collect Everything And Hang On To It Forever'

NEW YORK -- The CIA's chief technology officer outlined the agency's endless appetite for data in a far-ranging speech on Wednesday.

Speaking before a crowd of tech geeks at GigaOM's Structure:Data conference in New York City, CTO Ira "Gus" Hunt said that the world is increasingly awash in information from text messages, tweets, and videos -- and that the agency wants all of it. ... 917842.htm
User avatar
Posts: 2204
Joined: Thu Jun 09, 2011 6:05 am
Location: UK
Blog: View Blog (14)

Re: Data Mining & Intelligence Agencies

Postby coffin_dodger » Fri Jun 07, 2013 1:49 pm

^the above huffpo link is dead - but it's still on youtube

User avatar
Posts: 2204
Joined: Thu Jun 09, 2011 6:05 am
Location: UK
Blog: View Blog (14)


Return to Data And Research

Who is online

Users browsing this forum: No registered users and 2 guests