Veranstaltungen-APP/docs/IMPORT_SCRAPER_INTEGRATION.md

12 KiB

Import & Scraper-Integration für Laravel Event-Portal

📌 Übersicht

Die App unterstützt mehrere Integrationsoptionen für den Event-Import:

  1. Commands - Manuelle, einmalige Imports via Artisan-CLI
  2. Queue Jobs - Asynchrone, warteschlangen-basierte Imports
  3. Scheduler - Geplante, regelmäßige Imports (z.B. täglich)
  4. Webhooks/Events - Echtzeit-Updates von externen Quellen

🔧 Setup-Schritte

1. Abhängigkeiten installieren

# Für HTTP-Requests (externe APIs)
composer require laravel/http-client

# Für Web-Scraping (optional)
composer require symfony/dom-crawler symfony/http-client

# Für erweiterte Logging/Monitoring (optional)
composer require sentry/sentry-laravel

2. Queue-Konfiguration

Bearbeite .env:

QUEUE_CONNECTION=database  # oder redis, beanstalkd, etc.

Erstelle Queue-Tabelle:

php artisan queue:table
php artisan migrate

3. Sources erstellen

Füge über Seeder oder Admin-Interface Source-Records hinzu:

// database/seeders/SourceSeeder.php

use App\Models\Source;
use Illuminate\Database\Seeder;

class SourceSeeder extends Seeder
{
    public function run()
    {
        Source::create([
            'name' => 'Stadt Dresden',
            'description' => 'Offizielle Veranstaltungen der Landeshauptstadt Dresden',
            'url' => 'https://stadt-dresden.de/veranstaltungen',
            'status' => 'active',
        ]);

        Source::create([
            'name' => 'Kulturzentrum Hellerau',
            'description' => 'Veranstaltungen des Kulturzentrums Hellerau',
            'url' => 'https://hellerau.org',
            'status' => 'active',
        ]);
    }
}

Starten:

php artisan db:seed --class=SourceSeeder

👨‍💻 Verwendung

Option 1: Manueller Import via Command

# Alle aktiven Quellen importieren (asynchron)
php artisan events:import

# Nur eine spezifische Quelle (nach ID)
php artisan events:import --source=1

# Oder nach Name
php artisan events:import --source="Stadt Dresden"

# Synchron (blocking) ausführen
php artisan events:import --sync

Option 2: Programmgesteuert im Code

// In einem Controller, Service oder Command:

use App\Jobs\ImportEventsJob;
use App\Models\Source;
use App\Services\EventImportService;

// Via Service
$importService = app(EventImportService::class);
$importService->importFromAllSources($synchronous = false);

// Oder direkt Job Dispatchen
$source = Source::find(1);
ImportEventsJob::dispatch($source);  // Asynchron
ImportEventsJob::dispatchSync($source);  // Synchron

Option 3: Queue Worker ausführen

Damit die Jobs in der Queue abgearbeitet werden:

# Development: Ein Worker mit verbose Output
php artisan queue:work --verbose

# Production: Daemon-Mode mit Auto-Restart
php artisan queue:work --daemon --tries=3 --timeout=120

# Mit Supervisor für permanente Worker (Production)
# Siehe: https://laravel.com/docs/queues#supervisor-configuration

Scheduler-Integration

Täglicher Import via Scheduler

Bearbeite app/Console/Kernel.php:

<?php

namespace App\Console;

use App\Jobs\ImportEventsJob;
use App\Models\Source;
use Illuminate\Console\Scheduling\Schedule;
use Illuminate\Foundation\Console\Kernel as ConsoleKernel;

class Kernel extends ConsoleKernel
{
    /**
     * Register the commands for the application.
     */
    protected function commands()
    {
        $this->load(__DIR__.'/Commands');
        require base_path('routes/console.php');
    }

    /**
     * Define the application's command schedule.
     */
    protected function schedule(Schedule $schedule)
    {
        // ===== EVENT-IMPORTS =====

        // Täglicher Import um 03:00 Uhr nachts
        $schedule->command('events:import')
            ->dailyAt('03:00')
            ->name('events.daily_import')
            ->onFailure(function () {
                \Illuminate\Support\Facades\Log::error('Daily event import failed');
            })
            ->onSuccess(function () {
                \Illuminate\Support\Facades\Log::info('Daily event import completed');
            });

        // Zusätzlich: Stündliche Importe (z.B. für häufig aktualisierte Quellen)
        $schedule->command('events:import --source="Stadt Dresden"')
            ->hourly()
            ->name('events.hourly_import_dresden');

        // ===== CLEANUP & MAINTENANCE =====

        // Lösche abgelaufene Termine täglich
        $schedule->call(function () {
            \App\Models\EventOccurrence::where('status', 'scheduled')
                ->where('end_datetime', '<', now())
                ->update(['status' => 'completed']);
        })
        ->daily()
        ->at('04:00')
        ->name('events.mark_completed');

        // Lösche verwaiste Events ohne Termine
        $schedule->call(function () {
            \App\Models\Event::doesntHave('occurrences')
                ->where('status', 'published')
                ->where('created_at', '<', now()->subMonths(1))
                ->update(['status' => 'archived']);
        })
        ->weekly()
        ->name('events.cleanup_orphaned');

        // Runnable: Optional - teste dieSchedulerkonfiguration
        if (app()->environment('local')) {
            $schedule->command('inspire')->hourly();
        }
    }

    /**
     * Get the timezone that should be used by default for scheduled events.
     */
    protected function scheduleTimezone(): string
    {
        return 'Europe/Berlin';
    }
}

Scheduler im Production einrichten

Für Production brauchst du einen Cron-Job, der den Scheduler jede Minute aufruft:

# Crontab editieren
crontab -e

# Folgendes hinzufügen:
* * * * * cd /path/to/app && php artisan schedule:run >> /dev/null 2>&1

Oder mit systemd-Timer (Modern Alternative):

# /etc/systemd/system/laravel-scheduler.service
[Unit]
Description=Laravel Artisan Scheduler
Requires=laravel-scheduler.timer

[Service]
Type=oneshot
User=www-data
ExecStart=/usr/bin/php /path/to/app/artisan schedule:run

🔌 API-Integration: Beispiele für externe Quellen

Stadt Dresden API

// In ImportEventsJob::fetchExternalEvents()

use Illuminate\Support\Facades\Http;

$response = Http::withHeaders([
    'Accept' => 'application/json',
    'User-Agent' => 'Dresden-EventPortal/1.0',
])->get('https://api.stadt-dresden.de/v1/events', [
    'limit' => 1000,
    'filter[status]' => 'published',
]);

$events = $response->json('data');

iCal-Feed (z.B. von Google Calendar)

use Spatie\IcalendarParser\InvitationParser;

$feed = file_get_contents('https://calendar.google.com/calendar/ical/.../public/basic.ics');
$event = InvitationParser::parse($feed);

foreach ($event as $entry) {
    $events[] = [
        'external_id' => $entry['uid'],
        'title' => $entry['summary'],
        'location' => $entry['location'] ?? 'TBD',
        'description' => $entry['description'] ?? null,
        'occurrences' => [
            [
                'start_datetime' => $entry['dtstart'],
                'end_datetime' => $entry['dtend'] ?? null,
            ]
        ]
    ];
}

Web-Scraping mit DOM-Crawler

use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\HttpClient\HttpClient;

$client = HttpClient::create();
$response = $client->request('GET', 'https://example.com/events');
$html = $response->getContent();

$crawler = new Crawler($html);
$events = [];

$crawler->filter('.event-card')->each(function (Crawler $event) use (&$events) {
    $events[] = [
        'external_id' => $event->filter('[data-event-id]')->attr('data-event-id'),
        'title' => $event->filter('.event-title')->text(),
        'description' => $event->filter('.event-desc')->text(),
        'location' => $event->filter('.event-location')->text(),
        'occurrences' => [
            [
                'start_datetime' => $event->filter('[data-date]')->attr('data-date'),
            ]
        ]
    ];
});

🔄 Upsert-Logik erklärt

Die App verwendet Laravel's updateOrCreate() für Event-Duplikat-Handling:

// Suche Event mit (source_id, external_id)
// Falls existiert: Update mit neuen Daten
// Falls nicht: Erstelle neuen Record

$event = Event::updateOrCreate(
    [
        'source_id' => $source->id,
        'external_id' => $externalData['external_id'],
    ],
    [
        'title' => $externalData['title'],
        'description' => $externalData['description'] ?? null,
        'location' => $externalData['location'],
        // ... mehr Felder
    ]
);

if ($event->wasRecentlyCreated) {
    // Neuer Event
} else {
    // Event aktualisiert
}

Vorteile:

  • Verhindert Duplikate (unique index auf [source_id, external_id])
  • Aktualisiert existierende Events
  • Einfaches Handling bei mehreren Importen
  • Atomare Operation (transaktional)

📊 Monitoring & Logging

Job-Übersicht

# Anstehende Jobs in der Queue anschauen
php artisan queue:work --verbose

# Log-Output für Failure
tail -f storage/logs/laravel.log | grep ImportEventsJob

Custom Queue-Monitor Dashboard

// Beispiel: Dashboard für laufende Imports

Route::get('/admin/imports', function () {
    $failed = \Illuminate\Support\Facades\DB::table('failed_jobs')
        ->where('queue', 'default')
        ->latest()
        ->limit(20)
        ->get();

    $pending = \Illuminate\Support\Facades\DB::table('jobs')
        ->where('queue', 'default')
        ->count();

    return response()->json([
        'pending_jobs' => $pending,
        'failed_jobs' => $failed,
    ]);
});

🚀 Best Practices

1. Skalierung bei vielen Events

Für große Mengen an Events (1000+) pro Import:

  • Nutze Chunking: $externalEvents->chunk(100)
  • Batch-Processing mit InsertOnDuplicateKeyUpdateCommand
  • Disable Query Logging im Job
// In handle():
\Illuminate\Support\Facades\DB::disableQueryLog();

foreach ($externalEvents->chunk(100) as $chunk) {
    foreach ($chunk as $event) {
        $this->upsertEvent($event);
    }
}

2. Error Handling & Retries

// In ImportEventsJob versuchweise 3x erneut:
class ImportEventsJob implements ShouldQueue
{
    public $tries = 3;
    public $backoff = [60, 300, 900]; // Backoff: 1min, 5min, 15min
}

3. Rate Limiting für externe APIs

use Illuminate\Support\Facades\RateLimiter;

protected function fetchExternalEvents()
{
    return RateLimiter::attempt(
        'dresden-api-import',
        $perMinute = 10,
        function () {
            return Http::get('https://api.stadt-dresden.de/events')->json();
        },
        $decay = 60
    );
}

4. Transaction für Atomarität

use Illuminate\Support\Facades\DB;

DB::transaction(function () {
    foreach ($externalEvents as $externalEvent) {
        $this->upsertEvent($externalEvent);
    }
});

🔍 Troubleshooting

Queue-Jobs werden nicht verarbeitet

# 1. Checke Queue-Konfiguration
php artisan config:show queue

# 2. Starte einem Artisan Queue Worker
php artisan queue:work

# 3. Prüfe failed_jobs table
php artisan queue:failed

Import schlägt fehl - Externe API nicht erreichbar

// Nutze Http withoutVerifying für HTTPS-Fehler (nur dev!)
Http::withoutVerifying()->get('https://...');

// Oder mit Custom Timeout
Http::timeout(30)->get('https://...');

Duplicate Key Errors

// Prüfe Unique Index:
DB::raw('SHOW INDEX FROM events')

// Falls fehlt:
Schema::table('events', function (Blueprint $table) {
    $table->unique(['source_id', 'external_id']);
});

📚 Ressourcen