More

Changing data (attributes) in .csv of data vector layer and QGIS automatically adopting this?

Changing data (attributes) in .csv of data vector layer and QGIS automatically adopting this?


I have a .csv with data for 300 points: Each column on the .csv is a particular object and the cells define the amount that is available in each point.

If I need to change this amount, is it possible to change this in the .csv file in order that QGIS takes these edits automatically into its attribute table?


Yes, if a.csv file is loaded into QGIS and you update the .csv file then the attributes of the layer will also reflect the changes:

A .csv file loaded into QGIS:

Editing the .csv file with a simple text editor (I used Notepad):

Save the edits and then load the attribute table again in QGIS to see the updates:

Hope this helps!


I take it you expect to see the following:

And you see the following instead:

You're saving the file as UTF-8, but the program reading the file is decoding it using cp1252. These two have to match!

  1. Encode the text using cp1252 ( :encoding(cp1252) ) if the reader is going to continue decoding it using cp1252.
  2. Have the reader decode the file using UTF-8 if you're going to encode it as UTF-8 ( :encoding(UTF-8) ).

Generally speaking, the latter is the better option as it allows the file to contain any Unicode character rather than an abysmally small subset.

There a program called iconv on most Unix systems that can re-encode files from one encoding to another. You need to determine the original encoding of your file.

This would translate a file written in Windows using the default Code Page 1252 and convert it into UTF-8 encoding. I would first try cp1252 and see if that works. If not, try cp1250 , latin1 , and macintosh (It could have been a file created with MacRoman.


Graphical abstract

Mineral exploration targeting requires the compilation, integration and interrogation of diverse, multi-disciplinary data (e.g., Hronsky and Groves, 2008). In today’s world, this is commonly done using a geographic information system (GIS), the light table environment of our modern computer-age. A GIS is a powerful computer-based system designed to capture, store, manipulate, analyse, manage, and present spatial data (e.g., Groves et al., 2000) and their non-spatial attributes that can be stored in linked, interrogatable tables.

In mineral exploration targeting, as in many other fields of applied geoscience, GIS has surpassed the human ability to integrate and quantitatively analyse the ever-growing amount of geospatial data available, thereby progressively replacing the traditional working methods. Over the past two decades and driven by significant improvements in soft- and hardware capabilities, powerful GIS-based, algorithm-driven methods have been developed in support of exploration targeting. These methods fall under the umbrella term of mineral prospectivity mapping (MPM), also known as mineral prospectivity analysis, spatial predictive modelling or mineral potential modelling (e.g., Bonham-Carter, 1994, Pan and Harris, 2000, Carranza, 2008, Porwal and Kreuzer, 2010, Yousefi and Nykänen, 2017, Hronsky and Kreuzer, 2019). The resulting mineral prospectivity maps are typically generated through a combination of multiple evidential (or predictor) maps representing a set of targeting criteria (also referred to as mappable criteria or proxies), defined and combined based on measured spatial or genetic associations with the targeted mineral deposits and each other.

Despite significant methodological and computational advances over the past two decades, MPM is yet to prove effective in real world exploration (McCuaig and Hronsky, 2000, McCuaig and Hronsky, 2014, Porwal and Kreuzer, 2010, Joly et al., 2012, Hagemann et al., 2016a), particularly as regards helping to discover large, potentially economic mineral deposits. As discussed in Hronsky and Kreuzer (2019), whilst the concepts and technology behind MPM are sound, effective deployment is currently limited due to issues regarding the use of input data. The main implication of this thesis is that MPM will become an effective exploration tool once input data are processed and organised to more uniformly and objectively reflect the targeted search space and underlying targeting model.

The next important step-change required for MPM to become a more effective exploration tool is better integration of the conceptual mineral deposit model with data available to support exploration targeting. We believe this can be achieved by way of an exploration information system (EIS) that can address and handle complex natural phenomena such as mineral systems and, thereby, play a key role in exploration targeting and in the development of orebodies. Given the diversity and complexity of ore-forming processes and the tectonic environments of ore formation (e.g., McCuaig et al., 2010, Pirajno, 2016, Hagemann et al., 2016a), future implementation of an EIS would require (i) the generation of information from data (i.e., the collection and organisation of exploration and geoscience data in such a way that they become information that has additional value beyond the value of the original data themselves), (ii) knowledge-generation (e.g., from information about ore-forming processes), and (iii) the gaining of insight relevant to the development of mineral exploration targeting strategies. Hence, by converting the data to information, information to knowledge, and knowledge to insight, an EIS would facilitate problem-solving in mineral exploration targeting and provide a platform where mineral systems insight can be converted into mappable criteria and the prediction of undiscovered mineral deposits.

Here we introduce the conceptual framework for an EIS, a concept we view as the next logical step in the future development and adaption of GIS technology for use in mineral exploration targeting and MPM. In addition, we discuss how EIS would facilitate the more effective translation of conceptual ore deposit models into real-world exploration targeting models.


Mapping Points

"Geocoding" refers to the process of identifying an individual latitude/longitude pair for an address or other location description. To actually plot a location on a map, you need the location's latitude and longitude. 219 West 40th Street means nothing without coordinates.

Geocoding is actually challenging because there aren't good, free resources for doing batch jobs, where many addresses are geocoded at once. My Geocoding Tip Sheet includes some helpful resources, but many city data sources actually include coordinates, so double check that, first.

If you're committed to mapping points, you may need my help geocoding them.

Mapping Lines

No student has ever pitched me a compelling map that features lines rather than shapes or points. I did a project that drew out flight maps showing how far from home every prisoner incarcerated in Florence, CO is, but I pitched that, so it doesn't count. To draw that map I had to take a crash course in rendering lines. If you're excited about doing something like this, great! But you're going to need to install R and walk through Nathan Yau's tutorial before you do anything else.

Mapping Polygons

Zipcodes, council districts, police precincts -- these are all polygons. Most of your maps will be in polygons. These polygons are defined in (usually) one of two specialized file formats -- a "Shapefile" or a "KML" file. The syntax of the file types varies, but they contain basically the same information -- the polygon called "Bronx CB 04" is defined by this series of lat/lon pairs. My Shapefiles Tip Sheet has some excellent resources for finding shapefiles though a lot of the resources there are New York City specific.

Often (usually) your data won't include a shapefile. If you have High School graduation rates by school districts, and you want to map those, you need to find a shapefile that describes the outline of each school district, and then you need to combine that shapefile with your data, by identifying a column that the two tables have in common.


Adding American Community Survey Data.

About the Census

For the U.S., census data is used as a key dataset in understanding the health and progress of our society. It provides metrics about our society and is used to normalize other data for identifying and measuring issues in our economy, environment, and society. This tutorial will explain the proper method for querying and downloading census data preparing the data for QGIS and joining, analyzing, and styling the data.

For reference, the U.S. Census has two main surveys, the Decennial Census and the American Community Survey. The Decennial Census is the major census survey, which is carried out every 10 years and attempts to count every person in the country. It has two major disadvantages: one, it only happens every 10 years, so for the years in between the last census might be too outdated and the next one too far away and two, because it is not using any sampling techniques, it often under-represents minorities.

The second main survey is called the American Community Survey (ACS) and happens continuously. Its questionnaire is sent to 295,000 addresses monthly and it gathers data on topics such as ancestry, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. Its results come in 2 forms: 1-year estimates and 5-year estimates. The 1-year estimates are the most current but the least reliable. On the contrary, the 5-year estimates are not as current but are much more reliable. These are what we will use for our session today.

Downloading Census Data

In the folder we had you download, we have already provided the csv files of preprocessed census data. But its important to know where (and how) to access it. The main Census website is data.census.gov. Here, the easiest thing to do is to search for what you’re looking for. But how do you know what you need? I always start by navigating to Census Reporter, a go-to guide for understanding what data exists and what tables you can find the data in. On the Census Reporter Table Codes page, a simple search for School Enrollment tells us we want to look at table S1401. It even provides a link to the Census site.

On the Census website, we want to first filter by ACS 5-year. We then want to bring in the proper geography. In our case, we will visualize by counties. So click on Geos , and select All US Counties. Next, we will be prompted to download the table. Select this button and uncheck the 1-year estimate. Then select Download .

And now you have the 2019 5-year estimate at the county level for the table S1401. If you open the table, you’ll quickly be overwhelmed by the sheer count of columns and rows, as well as the codes for each.

If you open the ACSST5Y2019-S1401_edited.csv file provided in your download packet, we can walk through what I did to produce this file.

So here you can see the GEOID, the County Name, and then 6 columns of Census Data. Three of these columns are actually data about the college population (Total Population ages 18-24, Total Population Ages 18-25 Enrolled in College or Graduate School, and Percent Population Ages 18-24 Enrolled in College or Graudate School). In addition to each of these columns, the census also tells us the margin of error associated with these estimates. Since we’re working on the county level, MOE is less of an issue than if we were analyzing data at the tract or block group level.

So let’s begin by importing our data.

Select Add Delimited Text Layer and navigate to the CensusData/ACSST5Y2019-S1401_edited.csv . This file does not have geometry, so we need to turn that off. And select Add .

Now, like last time, I can already anticipate an issue with the GEOID due to variations between 4 and 5 characters in that column. So let’s fix that by navigating to the attribute table for S1401 layer.

We will begin by creating a new field (the abacus icon). We will label our field name ‘GEOID_Fixed’. We will set the field type to string. And our expression will be the following: lpad("GEOID", 5, '0')

Now, let’s join our county-level data to our county shapefile. To do so, open the properties of the county layer, and select ‘Joins’ from the options panel. We will be joining on the GEOID and GEOID_Fixed field for the county file and county ACS data file respectively. And we will provide a prefix of ‘S1401_’. Now let’s navigate to the Attribute table to make sure our join was successful.

So let’s visualize the data. For this exercise, we are going to make 3 maps – each helping us investigate the data we are looking at. To begin, we will look at the total population. To do this, navigate to Symbology in your Properties panel for the joined county data. Here, select Graduated as the type. For value, select TotPop1824 . And for the classification mode, select Logarithmic Scale .

Now let’s bring this into the composer for editing. Begin by importing the map by selecting the Add Map icon and click and drag across the canvas from corner to corner. Lets add legends for our data, sourcing, and a title, and export as a .png. This will be our first map, so we want to save a template for this print manager, and then get back to our main QGIS workspace.

Before doing anything, let’s duplicate our later with county data so that we don’t mess up our link to the print preview, should we want to make any future adjustments. So we will duplicate our county layer, next filtering by total population enrolled. Here, we will classify these using Equal Count (Quantiles).

Let’s bring this into the composer, starting with the previous template so that we only need to make minimal changes!

And finally, let’s make one last map, again duplicating our county layer. This time, let’s visualize it using the percent field, and let’s classify it using the mode Natural Breaks. Lastly, bring this into the print composer and produce a final graphic.


Sunday, 15 December 2019

Mapguide-react-layout dev diary part 21: Some long overdue updates and elbow grease

The previous blog series title is too long to type out, so I've shortened this blog series to be just called "mapguide-react-layout dev diary". It's much easier to type :)

So for this post, I'll be outlining some of the long overdue updates that have been done for the next (0.13) release (and why these updates have been held off for so long)

Due to the long gap between 0.11 and 0.12 releases I didn't want to rock the boat with some of the disruptive changes I had in the pipeline, choosing to postpone this work until 0.12 has settled down. Now that 0.12 is mostly stable (surely 8 bug fix releases should attest to that!), we can now focus on the original disruptive work I had planned and postponing the planned hiatus I had for this project.

Updating OpenLayers (finally!)

For the longest time, mapguide-react-layout was using OpenLayers 4.6.5. The reason we were stuck on this version was because this was the last version of OpenLayers where I was able to automatically generate a full-API-surface TypeScript d.ts definition file from the OpenLayers sources, through a JSDoc plugin that I built for this very purpose. This d.ts file provides the "intellisense" when using OpenLayers and type-checking so that we were actually using the OpenLayers API the way it was documented.

Up until the 4.6.5 release, this JSDoc plugin did its job very well. After this release, OpenLayers completely changed its module format for version 5.x onwards, breaking my ability to have updated d.ts files and without up-to-date d.ts files, I was not ready to update and given the expansiveness of the OpenLayers API surface, it was going to be a lot of work to generate this file properly for newer version of OpenLayers.

What brought me to originaly write this JSDoc plugin myself was that TypeScript compiler supported vanilla JavaScript (through the --allowJs flag), but for the longest time did not work in combination with the --declarations flag that allowed the TypeScript compiler to generate d.ts files from vanilla JS sources that we're properly annotated with JSDoc.

When I heard that this long standing limitation was finally going to be addressed in TypeScript 3.7, I took this as the time to see if we can upgrade to OpenLayers (now at version 6.x) and use the --allowJs + --declarations combination provided by TypeScript 3.7 to generate our own d.ts files for OpenLayers.

Sadly, it seems that the d.ts files generated through this combination aren't quite usable still, which was deflating news and I was about to put OL update plans on ice again until I learned of another typings effort for OpenLayers 5.x and above. As there were no other viable solutions, I decided to given these d.ts files a try. Despite lacking inline API documentation (which my JSDoc plugin was able to preserve when generating the d.ts files), these typings did accurately cover most of the OpenLayers API surface which gave me the impetus to make the full upgrade to OpenLayers 6.1.1, the latest release of OpenLayers as of this writing.

Also for the longest time, mapguide-react-layout was using Blueprint 1.x. What previously held us off from upgrading, besides dealing with the expected breaking changes and fixing our viewer as a result, was that Blueprint introduced SVG icons as replacement for their font icons. While having SVG icons is great, having the full kitchen sink of Blueprint's SVG icons in our viewer bundle was not as that blew up our viewer bundle sizes to unacceptable levels.

For the longest time, this had been a blocker to fully upgrading Blueprint until I found someone suggesting a creative use of webpack's module replacement plugin to intercept the original full icon package and replace it with our own stripped-down subset. This workaround meant brought our viewer bundle size back to acceptable levels (ie. Only slightly larger than the 0.12.8 release). With this workaround in place, it was finally safe to upgrade to the latest version of Blueprint, which is 3.22 as of this writing.

So we finally upgraded Blueprint, but our Blueprint-styled modal dialogs were still fixed size things whose inability to be resized really hampered the user experience of features that spawned modal dialogs or made heavy use of them (eg. The Aqua viewer template). Since we're on the theme of doing things that are long overdue, I decided to tackle the problem of making these things resizable.

My original mental notes were to check out the react-rnd library and see how hard it was to integrate this into our modal dialogs. It turns out, this was actually not that hard at all! The react-rnd library was completely un-intrusive and as a bonus was lightweight as well meaning our bundle sizes weren't going to blow out significantly as well.

So say hello to the updated Aqua template, with resizable modal dialogs!

Now unfortunately, we didn't win everything here. The work to update Blueprint and make these modals finally resizable broke our ability to have modal dialogs with a darkened backdrop like this:

This was due to overlay changes introduced with Blueprint. My current line of thinking around this is to . just remove support for darkened backdrops. I don't think losing this support is such a big loss in the grand scheme of things.

Hook all of the react components

The other long overdue item was upgrading our react-redux package. We had held on to our specific version (5.1.1) for the longest time because we had usages of its legacy context API to be able to dispatch any redux action from toolbar commands. The latest version removed this legacy context API which meant upgrading would require us to re-architect how our toolbar component constructed its toolbar items.

We were also using the connect() API, which combined with our class-based container components produced something that required a lot of pointless type-checking and in some cases forced me to fall back to using the any type to describe things

It turns out that the latest version of react-redux offered a hooks-based alternative for its APIs and having been sold on the power of hooks in react in my day job, I took this upgrade as an opportunity to convert all our class-based components over to functional ones using hooks and the results were most impressive.

Moving away from class-based container components and using the react-redux hooks API meant that we no longer needed to type state/dispatch prop interfaces for all our container components. These interfaces had to have all optional props as they're not required when rendering out a container component, but were set as part of when the component is connect()-ed to the redux store. This optionality infected the type system, meaning we had to do lots of pointless null checks in our container components for props that could not be null or undefined but we had to check anyway because of our state/dispatch interfaces saying so.

Using the hooks API mean that state/dispatch interfaces are no longer required as they are now all implementation details of the container component through the new useDispatch and useSelector hook APIs. It means that we no longer need to do a whole lot of pointless checks for null or undefined. Moving to functional components with hooks means we no longer need to use the connect() API (we just default export the functional component itself) and having to use the "any" type band-aid as well.

To see some visual evidence of how much cleaner and more compact our container components are, consider one of our simplest container components, the "selected features" counter:


3. Methodology

In order to find out the requirements for the deliverables of the Working Group, use cases were collected. For the purpose of the Working Group, a use case is a story that describes challenges with respect to spatial data on the web for existing or envisaged information systems. It does not need to adhere to certain standardised format. Use cases are primarily used as a source of requirements, but a use case could be revisited near the time the work of the Working Group will reach completion, to demonstrate that it is now possible to make the use case work.

The Working Group has derived requirements from the collected use cases. A requirement is something that needs to be achieved by one or more deliverables and is phrased as a specification of functionality. Requirements can lead to one or more tests that can prove whether the requirement is met.

Care was taken to only derive requirements that are considered to in scope for the further work of the Working Group. The scope of the Working Group is determined by the charter. To help keeping the requirements in scope, the following questions were applied:

  1. Is the requirement specifically about spatial data on the Web?
  2. Is the use case including data published, reused, and accessible via Web technologies?
  3. Has a use case a description that can lead to a testable requirement?

5. Conclusion

The traditional hydrologic modeling approach presents a major barrier for areas that lack the necessary resources to run a model. A HMaaS was developed to answer the need for water information in areas lacking the resources to run their own models. A large-scale streamflow prediction system based on the ECMWF ensemble global runoff forecast. However, this new model presents a series of challenges to run in an operational environment and to make the resulting streamflow information useful at the local scale. These “hydroinformatic” challenges were divided into four categories: big data, data communication, adoption, and validation. The developed model provides a high-density result by routing runoff volume from ECMWF using the RAPID routing model. A HMaaS approach was used to provide an answer to the communication challenges faced by a model covering such a large area. A cloud cyberinfrastructure was developed to host model workflows, inputs, and outputs. Web applications were deployed to expose results over the Internet. Web services such as a REST API and geospatial services were created to provide accessibility to forecasted results. Additional web applications were created with the main goal to allow customizations and provide flexibility for local agencies to use results according to specific needs. These projects were demonstrated in different countries around the world. Some of these countries include: Argentina, Bangladesh, Brazil, Colombia, Haiti, Peru, Nepal, Tanzania, the Dominican Republic, and the United States. We tested our results by comparing our forecasts to observed data. We determined that our model results are in essence the same as the GloFAS results, but in a higher density. We also determined that the our forecasted results are usually close to observed values and are able to capture most extreme events. Finally, we analyzed the effect of density variations on our model, and determined that sub-basin sizes do not significantly affect results at the mouth of the watershed.


1. Introduction

Map making and geocomputation are two essential crafts analysts need to master in order to extract knowledge from geographic data and to gain insights from data analysis using Geographic Information Systems (GIS). Yet, we know that the practice of map making as well as the analytical process are full of semantic intricacies that require a lot of training. Cartographic practice, for instance, entails a large amount of written and unwritten ‘rules’ about scales of measurement, data semantics and analytic intentions when selecting graphical symbols on a map (Müller et al. 1995 ). In a similar fashion, the application of GIS tools to construct geocomputational workflows is an art that goes largely beyond fitting data types to inputs and outputs (Hofer et al. 2017 ). In fact, meaningful analysis, i.e. the application of appropriate analytic methods to data sources of a specific origin for a given purpose (Stasch et al. 2014 , Scheider and Tomko 2016 ), requires considerable background knowledge about semantic concepts.

Consider the following example. Suppose we have one region attribute representing lake temperatures measured by environmental sensors, and another one denoting water volumes of these same lakes. Both have the same data type (polygon vector data). Suppose that for purposes of estimating hydroelectric energy potentials, we are interested in the water volume of all lakes as well as in the temperature of the water, in order to assess to what extent heating up by the power plant may lead to ecological damage downstream (Bobat 2015 ). For a skilled analyst, it is intuitively clear that total volume can be obtained by summing up lake volumes, whereas each measured temperature value needs to be weighted by the volume of the respective lake to arrive at a reliable estimate of the water’s average temperature. Furthermore, this analyst is likely to choose a choropleth map when visualizing lake temperature over space, but a bar chart or pie chart map for lake volume. The reason lies in the fact that intensive measures like temperature are independent on the size of their supporting object (in this case the area of the lake’s region, see Figure 1(b)), while extensive measures, such as lake volume, are additive (Figure 1(a)).

Published online:

Figure 1. Examples of extensive and intensive properties. Image (a) by kind permission of North Dakota Game and Fish department. Image (b) by open attribution license (CC BY).

Figure 1. Examples of extensive and intensive properties. Image (a) by kind permission of North Dakota Game and Fish department. Image (b) by open attribution license (CC BY).

We lack methods for automatic labeling of data sets and attributes with extensiveness/intensiveness. Manual labeling is seldom done in practice and does not scale with the speed of data production (Alper et al. 2015 ).

We lack methods for systematically assessing the space of meaningful geocomputational/cartographic method applications to extensive/intensive properties. In essence, we lack a theory that would allow us to explore this space in a systematic manner once data are labeled.

In this article, we address both challenges through investigating possible solutions from machine learning (ML) and geospatial semantics (Egenhofer 2002 , Janowicz et al. 2013 ). For tackling the first challenge, we test several supervised ML classification algorithms on different kinds of (geo-)statistical features extracted from region statistics data which capture the relation between areas and their attributes (Section 3). Regarding the second challenge, we review textbook knowledge about the applicability of cartographic and geocomputational methods and encode it using an algebraic model expressed in terms of an ontology design pattern (Section 4). Together with the result of geodata labeling, this pattern can be used to select kinds of GIS tools adequate for intensive (IRA) and extensive region attributes (ERA), for workflow automation and data/tool recommendation on statistical portals (compare Figure 2). We explain each method, discuss its results and give an outlook in the corresponding sections. We start with reviewing the state of the art about extensive and intensive properties.

Published online:

Figure 2. Approach taken in this article. We suggest an operational design pattern and test approaches for labeling statistical attributes and corresponding GIS tools. Workflow automation and data/tool recommendation are considered future work.

Figure 2. Approach taken in this article. We suggest an operational design pattern and test approaches for labeling statistical attributes and corresponding GIS tools. Workflow automation and data/tool recommendation are considered future work.


2. Methods

This section briefly describes the Symbolic Machine Learning (SML) classifier at the heart of the GHSL image classification workflow, the input imagery and the ancillary data used for extracting the multitemporal built-up area grids. The methodological steps followed for the classification of built-up areas at each single epoch and for the production of the final multitemporal grids are also presented.

2.1. The symbolic machine learning for large-scale data analytics

Reduce the data instances to a symbolic representation, also called unique discrete data-sequences

Evaluate the association between the unique data-sequences X (input features) and the learning set Y (known class abstraction derived from a learning set).

In the application proposed here, the data-abstraction association is evaluated by a confidence measure called Evidence-Based Normalized Differential Index (ENDI) which is produced in the continuous [−1, 1] range. The ENDI confidence measure Φ of data instances X provided the Y + positive and negative Y − data instances from the learning set is defined as follows: (1) Φ X | Y + , Y − = f + − f − f + + f − (1)

where f + and f − are the frequencies of the joint occurrences among data instances and the positive and negative data instances, respectively.

To achieve a binary classification (i.e. two class datasets: built-up vs. non-built-up surfaces), a cut-off value of Φ is automatically estimated for assigning each data sequence to a single class. For the dataset presented here, the OTSU thresholding approach (Otsu, 1979 ) is used for the purpose of binarizing the ENDI output Φ . The OTSU method chooses an optimal threshold by minimizing within-class variance and maximizing the between-class variance.

The SML automatically generates inferential rules linking the satellite image data to available high-abstraction semantic layers used as learning sets. In principle, any data thematically linked or approximating the “built-up areas” class abstraction with an exhaustive worldwide coverage can be used for deriving human settlements information from any satellite imagery. There is no need for full a priori spatial and temporal alignment between the input imagery, and the learning set nor calibration of the input data as the SML learning process is computationally efficient and can be executed “on-the-fly” for every input satellite scene. Details on the SML algorithm and its suitability for processing of big earth data are provided in (Pesaresi et al., 2016 ). In Pesaresi et al. ( 2015 ), the performance of the SML was compared to alternative supervised classification algorithms such Maximum Likelihood, Logistic Regression, Linear Discriminant Analysis, Naive Bayes, Decision Tree, Random Forest and Support Vector Machine. According to these experiments, at the parity of data conditions (same input image features and same quality or same noise level of the reference set), the SML approach was generally outperforming both parametric and non-parametric classifiers. Furthermore, the better performances were obtained with a much less expensive computational cost. Consequently, the SML classifier was evaluated as the best available solution for large-scale processing of big earth data.

2.2. The Landsat data collections

7,597 scenes acquired by the Multispectral Scanner (MS) (collection 1975)

7,375 scenes acquired by the Landsat 4–5 Thematic Mapper (TM) (collection 1990)

8,788 scenes acquired by the Landsat 7 Enhanced Thematic Mapper Plus (ETM+) (collection 2000) and

9,442 scenes acquired by Landsat 8 Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS) (collection 2014).

The 1975, 1990, and 2000 collections use the Global Land Survey (GLS) datasets pre-processed by the Maryland University and described in Gutman, Huang, Chander, Noojipady, and Masek ( 2013 ).

The scenes from the collection 2014 were directly downloaded from the USGS website (United States Geological Survey, 2013 ). These input data show large heterogeneity in terms of completeness but also in the sensor characteristics evolving from Landsat-1 to Landsat-8. In this data universe, large differences in the radiometric and spatial resolutions (from 15 to 60 m) of the spectral bands, as well as the noise-to-signal ratio of the data, are observed (Table 1). The data availability in the different collections is shown in Figure 1: there are large data gaps in the northern part of South America and the whole Greenland is missing in the 1975 collection. Alike, large parts of Siberia are missing in the 1990 collection. Moreover, the incomplete metadata information for 16.6% and 32.8% of scenes of the 1975 and 1990 collections, respectively, was not allowing for the estimation of the top-of-atmosphere reflectance parameters and consequently directed us towards a classification strategy based on a per-scene machine learning as proposed in this study.