Google Summer of Code 2014: - Search Engine for Hidden Services

Juha Nurmi, 21.04.2014

Organization: The Tor Project and EFF
Short description: I would like to develop - search engine for hidden services. It needs a lot love and care. I have founded, developed and maintained and would like to continue doing so. I have published the source code of
Additional info:

1. What project would you like to work on? Use our ideas lists as a starting point or make up your own idea. Your proposal should include high-level descriptions of what you're going to do, with more details about the parts you expect to be tricky. Your proposal should also try to break down the project into tasks of a fairly fine granularity, and convince us you have a plan for finishing it. A timeline for what you will be doing throughout the summer is highly recommended.

I would like to work on the project Search Engine for Hidden Services.

I would like to develop as a free software (see a short presentation about I have been developing and maintaining search engine. It needs a lot love and care. is making the Tor network accessible in many different ways: listing hidden services, gathering their descriptions and providing a full text search.

During my GSoC I have planned to the implement various new key features to

Search development

Full text search development

  • Popularity tracking (catch users clicks and tell YaCy the popular pages): development of a popularity tracking feature for and integration with YaCy API (providing stats for popular pages and suggestions for relevant results)
    • Using JavaScript or directing URLs to detect what the user clicks
    • Search -> detect click -> send information to Django -> Django send information to the YaCy nodes -> gather popularity
    • Passing this information to the YaCy index
    • Better search results
    • Show TOP pages
    • 2 workweek
  • Use an another crawler to search .onion pages from the public Internet
    • Search new .onion domains from different online sources
    • Ask help from organizations that are crawling
    • Checking out the backlinks from public WWW
    • This is an excellent case to test open source crawlers like Heritrix and Apache Nutch
    • Optionally, we can replace YaCy with another crawler if we find one of them better than YaCy
    • Better search results
    • 2 workweeks
  • Public open YaCy back-end for everyone
    • Let's make our YaCy network open so anyone can join their YaCy nodes
    • This way we could get real P2P decentralization
    • is a free software and the back-end YaCy network should be free to everyone; also, we will get voluntary YaCy nodes this way
    • Share installation configuration package that joins a YaCy node to's nodes
    • 1 workweek

Better edited HS descriptions

  • Design and development of a more useful and complete UI including more complete and exhaustive descriptions and details (e.g., show the whole history of descriptions and let the users edit it better)
    • Requires security conscious design
    • Show sites popularity
    • Commenting features
    • Authenticated hidden service information about the hidden service: what does it say about itself
    • Expose some of popularity/backlinks information to users, in case that lets them pick results more safely
    • 2 workweek

Tor browser friendly version of

  • Hidden service mirror for
    • Shared SQL database and YaCy back-end
    • Physical server in secure and unknown place
    • 1 workweek

Information about hidden services and their content: Automated statistics and visualizations

  • Development of an analytics features
    • As the result of the indexing Tor network's content can produce an authoritative and exact quantitative research data about what is published through the Tor network
    • Share information about each site found: Server type, how long it has been online/offline, when it was crawled, popularity and backlinks, keywords, language...
    • Number of different types of HSs
    • RESTful JSON API that provides the data
    • 2 workweeks
  • Automated visualizations
    • It is very practical to visualize the data
    • What these hidden services are? number of web server, IRC servers, BitTorrent trackers etc.
    • Word clouds: we can even cluster which hidden services are close to each other and show some connections
    • Backlinking visualization
    • I already generated some SVG pictures of the backlinking between .onion sites
  • 1 workweeks
  • Show cached text versions of the pages
    • There has been cached text versions of the pages but I had to remove them
    • The problem is non-trivial: there are a lot of ways to inject pictures and harmful JavaScript to the text cache
    • when I found that someone even injected images using only URL schema I had to take down the text cache (data:[<MIME-type>][;charset=<encoding>][;base64],<data>)
    • 2 workweek

    API development

    In addition, provides RESTful API to integrate other services to use hidden service description information (see Hidden services can integrate their descriptions directly to the hidden service list (see knows which hidden services are online and you can use the API to check hidden service's online status. This API should be maintained to keep it general and simple. Furthermore, uses this API internally.

    Integration with softwares that are using hidden services

    • Integration with Tor2web: find new .onion domains
      • Thanks to our suggestion recently, Tor2web has implemented a feature that provides secure and anonymous statistics within a day. I want to implement an automatic fetch and handling of this data
      • should fetch these and add each new .onion page to its database
      • 1 workweek
    • Integration with Tor2web: Child abuse detection
    • Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time
      • we would also like to get automated signal from the Tor2web nodes when they are banning some site so can also ban that site if necessary
      • Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time
      • A well designed and authoritative entity may be useful for provide some filtering lists. To this aim we are currently handling manually a filter list already integrated with Tor2web and in use on quite all the nodes of the Tor2web network (, In collaboration with Tor2web I want to develop an efficient and automated system to handle and share a filtering information in a secure manner
      • we are only sharing the MD5Sum of the banned domain
      • 1 workweek
    • Globaleaks integration
      • Currently, GlobaLeaks informs to index new hidden services
      • Globaleaks is good reputation to the Tor network
      • could extend the visibility of Globaleaks on the search results
      • Together with GlobaLeaks: RESTful API according to Globaleaks' needs and an UI to show information about Globaleaks nodes
      • 1 workweek

    What I have to show people at the end of the summer; a priority queue to the tasks:

    The main features that will be done during the summer:

    1. Integration with Tor2web to gather new .onion domains
    2. Child abuse detection and filtering information sharing
    3. Another crawler to search .onion links from the public Internet
    4. Backlink checking
    5. Popularity tracking

    workweeks: 7

    And most of these features:

    1. Automated statistics and JSON API
    2. Automated visualizations of the statistics
    3. Hidden service mirror to
    4. Show text cache of the pages

    workweeks: 6

    Hopefully, some of these features:

    1. Better UI to edited HS descriptions
    2. Globaleaks integration
    3. Public open YaCy back-end

    workweeks: 3

    In case there is a task that is much slower to implement than forecast we will re-evaluate it or move to the next task on the queue.

    2. Point us to a code sample: something good and clean to demonstrate that you know what you're doing, ideally from an existing project.

    My working search engine:
    The source code of the

    3. Why do you want to work with The Tor Project in particular?

    I would love to support human rights. I believe that human rights are important because without them life would be controlled by somebody else and people could not make decisions themselves.

    In practice, free software is one way to support human rights. In particular, Tor Project is providing this kind of free software I would love to support.

    Anonymity is an important right in order to support freedom of speech and defend human rights. I have been actively contributing to the Tor Project since 2010 by implementing the first public search engine for hidden services,, and by running a very fast exit relay and by maintaining filtering list and I have significant hands-on competence with Tor and search engines.

    Moreover, I am planning to join to and launch several fast exit relays in Finland.

    4. Tell us about your experiences in free software development environments. We especially want to hear examples of how you have collaborated with others rather than just working on a project by yourself.

    As a Linux user, I have been using and supporting free software over ten years.

    I am a contributor to Callimachus open source project (a framework for data-driven applications based on Linked Data). Callimachus aims to make Semantic Web applications easier to create.

    I am a Fellow member of Hermes Center for Transparency and Digital Human Rights; I have built a minimal integration API between my search engine and their software: GlobaLeaks (an open source project aimed at creating a worldwide, anonymous, censorship-resistant, distributed whistleblowing platform) and Tor2web (an open source project aiming to allow transparent Internet exposure of websites running on Tor Hidden Services).

    I was a volunteer and a lecturer in Observe, Hack, Make 2013: A five day international hacker festival in the Netherlands. There I presented project to the other hackers.

    Also, I am a member of the OKF Finland Open Science Work Group (OKF). The OKF is a hub for community-driven activities around open science to advocate standards of openness in Finnish academia and facilitate transfer of knowledge between academic institutions and wider society. I am pushing researchers to publish their source codes with proper licensing.

    5. Will you be working full-time on the project for the summer, or will you have other commitments too (a second job, classes, etc)? If you won't be available full-time, please explain, and list timing if you know them for other major deadlines (e.g. exams). Having other activities isn't a deal-breaker, but we don't want to be surprised.

    Yes, full-time.

    6. Will your project need more work and/or maintenance after the summer ends? What are the chances you will stick around and help out with that and other related projects?

    I am already maintaining the search engine and going to continue doing so.

    7. What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project? Said another way, how much of a "manager" will you need your mentor to be?

    Using familiar messaging systems, such as Email, IRC and Jabber. I am going to publish weekly updates to the tor-dev mailing list. will be updated weekly. Weekly online meeting with the mentor is sufficient.

    I can travel to Italy to meet Globaleaks and Tor2web developers if it is necessary and helps to develop the API.

    8. What school are you attending? What year are you, and what's your major/degree/focus? If you're part of a research group, which one?

    I am a Ph.D student at the Tampere University of Technology. My major is semantic computing. Since 07.2010, I have been working at the department of mathematics / Intelligent Information Systems Laboratory. First as a research assistant and then after master's degree (1.7.2013) I have been working as a project researcher and a lecturer.

    9. How can we contact you to ask you further questions? Google doesn't share your contact details with us automatically, so you should include that in your application. In addition, what's your IRC nickname? Interacting with us on IRC will help us get to know you, and help you get to know our community.

    IRC Channel: OFTC/#ahmia
    Twitter: @AhmiaNews
    OTR Fingerprint: 65FE90B9E3D7DCF29398516CC01DED21DD31256D

    10. Are you applying to other projects for GSoC and, if so, what would be your preference if you're accepted to both? Having a stated preference helps with the deduplication process and will not impact if we accept your application or not.

    This is the only project I am applying to.

    11. Is there anything else that we should know that will make us like your project more?

    This is what I would really like to do. I have spent a lot of time to help Tor. Building a search engine for the hidden services is relevant and useful for the whole community. I have a solid background in Web systems and virtual private networks. I am a teacher at the university and a software engineer; I know what I am doing! :)

    Finally, I propose a small, precisely targeted development project to since I am already maintaining it and have independently worked with various organizations that use Tor and develop search engines. I am able to use this kind of “lean” approach as it is the working model of the related developer communities, allowing us to align our development – and naturally interact with the talented individuals who effectively develop open-source systems like Tor, Tor2web, Globaleaks and YaCy. Futhermore, I know the developer of who offers technical insight to me.

    A man-in-the-middle fake clone detected!
    Right onion address starts with msydq and ends with zerdg.onion.
    Find real address from